-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add subworkflow for functional enrichment analysis #7254
base: master
Are you sure you want to change the base?
Conversation
…ow locally implemented in differentialabundance. The original code was the result of the effort from multiple collaborators: Co-authored-by: Björn Langer <[email protected]> Co-authored-by: Cristina Araiz <[email protected]> Co-authored-by: Thomas Tams <[email protected]> Co-authored-by: Breeshey Roskams-Hieter <[email protected]>
…profiler2 that works when empty channels are given as Channel.of([], []), but stop working when empty channels are given as null, as combine/join methods cannot work on null
…work. Currently it is not running through GSEA modules -> check input channel
…ults as the module itself too.
@pinin4fjords @grst @nschcolnicov @mirpedrol @JoseEspinosa Hello! Here you have the functional analysis subworkflow! Let me know what you think, specially in terms of optional input/output that would be needed given the fact that different functional analysis tools might need different inputs and produce different outputs. On the other hand, the gprofiler2 test always fails in the CI here when using conda, as it produces png files with different md5. However, when I run it using Gitpod with conda, it always passes. Any idea of why? @pinin4fjords @nschcolnicov |
Thanks @suzannejin ! I'll have a look closely soon. PNGs are not snapshotable, they change across systems, and you've noted. The best you can do with those is check the name. |
// here we define the input channel for the GSEA section | ||
|
||
def criteria = multiMapCriteria { meta_input, input, meta_exp, samplesheet, featuresheet, features_id, features_symbol, meta_contrasts, variable, reference, target -> | ||
def analysis_method = meta_input.method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed if we merge meta_input
which already contains the method
key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right! this was left from a previous implementation that need it explicit, but I can remove it now
ch_gene_sets.collect(), | ||
ch_background.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using collect()
correct? won't it put everything (meta + files) into one single list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason for using .collect() is that when ch_input has more than one element, but we only provide one ch_gene_sets elements, the module would still run as many times as ch_input elements.
I assume here that we are gonna only provide one gene_sets, this is likely the behavior for gprofiler2 and grea. For GSEA, the differentialabundance pipeline can take more than one gene_sets. Is there any reason for this difference? And how would you deal with this? @pinin4fjords
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just replaced collect()
by combining input with gene_sets before hand. This should work for either scenario :)
subworkflows/nf-core/differential_functional_enrichment/tests/all.config
Show resolved
Hide resolved
ch_input // [ meta_input, input file, method to run ] | ||
|
||
// gene sets and background | ||
ch_gene_sets // [ meta_gmt, gmt file ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this necessarily need to be a GMT file?
E.g. decoupler uses weighted gene sets for PROGENy and collecTRI analyses that are typically provided as a long-form data frame. If we included deconvolution as functional analysis tool, it would typically use a signature matrix.
It really depends a bit on the scope of this subworkflow. But if the plan is to support a wide range of functional analysis tools as suggested in nf-core/differentialabundance#367, it would be good to keep this generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but I'm always wary of premature over-engineering. I don't object to gmt in the first instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically my suggestion comes down to making the gene sets + background method-specific. Already the current workflow logic doesn't really care about whether it's a gmt file or not. It's only specified in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is possible to standardize the gene set input format for all methods, and then each module deals with the reformatting to the proper format specifically needed for the method?
Otherwise I would imagine it become confusing from the pipeline's user perspective to have to provide the input gene set with certain format depending on the method chosen, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The "long-form dataframe" format used by decoupler is quite universal. It can cover signature matrices, weighted and signed gene sets.
- This is also the format that can be obtained from omnipathdb via API. Omnipathdb contains most of the commonly used signatures, such as MSigDB, GO, Dorothea, Progeny. Like that users wouldn't need to obtain the genesets themselves, but just specify the name.
- I agree that a common format makes sense, but I might still want to couple certain signatures to certain methods and not necessarily run all-vs-all. E.g. With PROGENy signatures, I'd typically use the recommended MLM algorithm in decoupler, while with MSigDB signatures, I'd rather use GSEA.
- Not entirely sure yet if deconvolution is in the scope of this subworkflow, but here the methods are often shipped together with a signature matrix, so they wouldn't require any input signature at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. It would be really nice to include omnipathdb, then the coupling between signature and method can be automatized based on name.
For now, we could have a method specific gene sets channel just as input channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not thrilled by the prospect of a multiplicity of gene_set inputs per method.
@suzannejin Maybe the gene sets should actually be part of ch_input. That way, there could be method-specific gene set files in that channel, and we wouldn't need multiple input channels
ch_background // [ meta_background, background file ] | ||
|
||
// other - for the moment these files are only needed for GSEA | ||
ch_contrasts // [ meta_contrast, contrast_variable, reference, target ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) When providing results from differential expression analysis, a contrast would not be needed.
(2) When providing gene expression as input, differential testing is necessary that is not unlike DE analysis (it needs a model and a contrast definition).
IMO it would make sense to keep differential testing entirely out of this workflow. In case (1), it's not needed, and in case (2), a sample x signature matrix is produced. This matrix could just be fed into a differential analysis workflow (e.g. limma) again.
Like that we can keep the complexity of this subworkflow low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware that the gsea module does support providing a contrast directly when using gene expression data. However, I believe that there's no point in using this mode.
In this case, GSEA anyway just computes a metric (signal2noise, t-statistic, ...) based on these variables (see docs), so we can as well provide a fold change or DE-statistic directly, while having the advantage that we can provide a full model definition including covariates to the DE method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware that the gsea module does support providing a contrast directly when using gene expression data. However, I believe that there's no point in using this mode.
Fair point, but it's what diff. ab. does right now. I'd recognised the possbility of a switch, but hadn't got round to it: nf-core/differentialabundance#36
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point that we could model the current matrix-driven GSEA as part of the differential subworkflow, alongside LIMMA et al though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted the current subworkflow to be able to produce the exact behaviour of how the modules are used in the DA pipeline, hence GSEA is taking gene expression data instead of DE. Not sure why this is mode is chosen for the pipeline, maybe @pinin4fjords can step in here?
But yes, I do agree that it would be conceptually cleaner if the subworkflow just takes in DE output, and so for GSEA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of taking this chance of streamlining the workflow now that we are anyway changing quite a few things. We can help with implementing a module for preranked GSEA if required1.
(alternatively, the decoupler module is kind of ready, and it comes with a very fast GSEA implementation -- it should generate the same results in terms of scores and p-values, but it doesn't produce all outputs such as the "leading edge" plots).
Footnotes
-
In a few weeks. Waiting for the contract extension with our external developers to be signed. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to be a pain, but I do like those leading edge plots, they are useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against a preranked GSEA module. But what I really would like to get rid of is the contrast specification in this subworkflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually like to start integrating the subworkflows in the current DA pipeline hopefully next week.
If the switch to preranked GSEA would take some time, maybe we could first agree on the current subworkflow version? I think it could serve as a nice starting point, and you are welcome to add modifications above it afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments, it's going the right way.
IMO having GSEA in here is fine, I can imagine other methods in here in future might use them, they can be optional, and 'pre-ranked' methods can ignore them.
I do think the gene_sets should be part of the iterable input if they're going to be method-specific, to save us adding method-wise get set channels to the interface.
- meta_input: | ||
type: map | ||
description: Metadata map | ||
- input: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on what is meant by 'input' here please?
ch_input // [ meta_input, input file, method to run ] | ||
|
||
// gene sets and background | ||
ch_gene_sets // [ meta_gmt, gmt file ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not thrilled by the prospect of a multiplicity of gene_set inputs per method.
@suzannejin Maybe the gene sets should actually be part of ch_input. That way, there could be method-specific gene set files in that channel, and we wouldn't need multiple input channels
|
||
// gsea-specific outputs | ||
gsea_report = GSEA_GSEA.out.report_tsvs_ref | ||
.join(GSEA_GSEA.out.report_tsvs_target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange indent
|
||
// tool versions | ||
versions = ch_versions | ||
.mix(GPROFILER2_GOST.out.versions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange indents
.mix(CUSTOM_TABULARTOGSEACLS.out.versions) | ||
.mix(CUSTOM_TABULARTOGSEACHIP.out.versions) | ||
.mix(GSEA_GSEA.out.versions) | ||
.mix(PROPR_GREA.out.versions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there no emissions?
PR checklist
Closes #XXX
versions.yml
file.label
nf-core modules test <MODULE> --profile docker
nf-core modules test <MODULE> --profile singularity
nf-core modules test <MODULE> --profile conda
nf-core subworkflows test <SUBWORKFLOW> --profile docker
nf-core subworkflows test <SUBWORKFLOW> --profile singularity
nf-core subworkflows test <SUBWORKFLOW> --profile conda