Merge branch 'dev' into tools_cutoff

nf-core · Sep 25, 2023 · c23a6ca · c23a6ca
2 parents 9abda97 + e3be022
commit c23a6ca
Show file tree

Hide file tree

Showing 25 changed files with 567 additions and 896 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,7 +3,7 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v2.4.0dev
+## v2.4.0 - [2023/09/22]
 
 ### Added
 
@@ -12,7 +12,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Use institutional configs by default [#381](https://github.com/nf-core/rnafusion/pull/381)
 - Remove redundant indexing in starfusion and qc workflows [#387](https://github.com/nf-core/rnafusion/pull/387)
 - Output bai files in same directory as bam files [#387](https://github.com/nf-core/rnafusion/pull/387)
-- Removed `--fusioninspector_filter` and `--fusionreport_filter` in favor of `--tools_cutoff` (default = 1, no filters applied) [#389](https://github.com/nf-core/rnafusion/pull/389)
+- Update and review documentation [#396](https://github.com/nf-core/rnafusion/pull/396)
+- Update picard container for `PICARD_COLLECTRNASEQMETRICS` to 3.0.0 [#395](https://github.com/nf-core/rnafusion/pull/395)
+- Renamed output files [#395](https://github.com/nf-core/rnafusion/pull/395)
+  - `Arriba` visualisation pdf from meta.id to meta.id_combined_fusions_arriba_visualisation
+  - cram file from output bam of `STAR_FOR_ARRIBA`: meta.id to meta.id_star_for_arriba
+  - cram file from output bam of `STAR_FOR_STARFUSION`: meta.id to meta.id.star_for_starfusion.Aligned.sortedByCoord.out
+  - `fusion-report` index.html file to meta.id_fusionreport_index.html
+  - meta.id.vcf output from `MEGAFUSION` to meta.id_fusion_data.vcf
 
 ### Fixed
 
@@ -21,10 +28,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Provide gene count file by default when running STAR_FOR_STARFUSION [#385](https://github.com/nf-core/rnafusion/pull/385)
 - Fix fusion-report issue with MACOXS directories [#386](https://github.com/nf-core/rnafusion/pull/386)
 - The fusion lists is updated to contain two branches, one in case no fusions are detected and one for if fusions are detected, that will be used to feed to fusioninspector, megafusion, arriba visualisation [#388](https://github.com/nf-core/rnafusion/pull/388)
+- Update fusionreport to 2.1.5p4 to fix 403 error in downloading databases [#403](https://github.com/nf-core/rnafusion/pull/403)
 
 ### Removed
 
-## v2.3.0 = [2022/04/24]
+- `samtools sort` and `samtools index` for `arriba` workflow were dispensable and were removed [#395](https://github.com/nf-core/rnafusion/pull/395)
+- Removed trimmed fastqc report from multiqc [#394](https://github.com/nf-core/rnafusion/pull/394)
+
+## v2.3.0 - [2023/04/24]
 
 ### Added
 
@@ -47,7 +58,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Removed
 
-## v2.2.0 - [2022/03/13]
+## v2.2.0 - [2023/03/13]
 
 ### Added
 
@@ -84,7 +95,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 - FUSIONINSPECTOR_DEV process as the option fusioninspector_limitSjdbInsertNsj is part of the main starfusion release
 
-## [2.1.0] nfcore/rnafusion - 2022/07/12
+## v2.1.0 - [2022/07/12]
 
 ### Added
 
@@ -118,7 +129,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Removed
 
-## [2.0.0] nfcore/rnafusion - 2022/05/19
+## v2.0.0 - [2022/05/19]
 
 Update to DSL2 and newer software/reference versions
 
@@ -260,7 +271,7 @@ to
 - GRCh37 support. Subdirectory with params.genome are removed
 - Running with conda
 
-## v1.3.0dev nfcore/rnafusion - 2020/07/15
+## v1.3.0 - [2020/07/15]
 
 - Using official STAR-Fusion container [#160](https://github.com/nf-core/rnafusion/issues/160)
 
@@ -291,7 +302,7 @@ to
 
 ---
 
-## [1.1.0] nfcore/rnafusion - 2020/02/10
+## v1.1.0 - [2020/02/10]
 
 - Fusion gene detection tools:
   - `Arriba v1.1.0`
@@ -339,7 +350,7 @@ to
 
 ---
 
-## [1.0.2] nfcore/rnafusion - 2019/05/13
+## v1.0.2 - [2019/05/13]
 
 ### Changed
 
@@ -353,7 +364,7 @@ to
 
 ---
 
-## [1.0.1] nfcore/rnafusion - 2019/04/06
+## v1.0.1 - [2019/04/06]
 
 ### Added
 
@@ -381,7 +392,7 @@ to
 
 ---
 
-## [1.0] nfcore/rnafusion - 2018/02/14
+## v1.0 - [2018/02/14]
 
 Version 1.0 marks the first production release of this pipeline under the nf-core flag.
 The pipeline includes additional help scripts to download references for fusion tools and Singularity images.

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,34 +10,57 @@
 
 ## Pipeline tools
 
-- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+- [Arriba](https://github.com/suhrig/arriba)
 
-  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
+  > Uhrig S, Ellermann J, Walther T, Burkhardt P, Fröhlich M, Hutter B, Toprak UH, Neumann O, Stenzinger A, Scholl C, Fröhling S, Brors B. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Research. 2021 Mar 31;448-460. doi: 10.1101/gr.257246.119. Epub 2021 Jan 13. PubMed PMID: 33441414.
 
-- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
+- [BEDOPS](https://bedops.readthedocs.io/en/latest/index.html) - convert2bed
 
-  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
+  > Neph S, Scott Kuehn M, Reynolds AP, Haugen E, Thurman RE, Johnson AK, Rynes E, Maurano MT, Vierstra J, Thomas S, Sandstrom R, Humbert R, Stamatoyannopoulos JA. BEDOPS: high-performance genomic feature operations. Bioinformatics. 2012 May, 28 (14): 1919-1920. doi: 10.1093/bioinformatics/bts277, PubMed PMID: PMID: 22576172.
 
-- [Arriba](https://github.com/suhrig/arriba)
+- [FastP](https://academic.oup.com/bioinformatics/article/34/17/i884/5093234)
 
-  > Uhrig S, Ellermann J, Walther T, Burkhardt P, Fröhlich M, Hutter B, Toprak UH, Neumann O, Stenzinger A, Scholl C, Fröhling S, Brors B. Accurate and efficient detection of gene fusions from RNA sequencing data.
-  > Genome Research. 2021 Mar 31;448-460. doi: 10.1101/gr.257246.119. Epub 2021 Jan 13. PubMed PMID: 33441414; PubMed Central PMCID: PMC7919457.
+  > Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sept 34:17 (i884–i890), doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086. PubMed Central PMCID: PMC6129281
+
+- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+
+  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
 
 - [FusionCatcher](https://github.com/ndaniel/fusioncatcher)
 
   > Nicorici D, Satalan M, Edgren H, Kangaspeska S, Murumagi A, Kallioniemi O, Virtanen S, Kilkku O. FusionCatcher – a tool for finding somatic fusion genes in paired-end RNA-sequencing data. BioRxiv, 2014 Nov. doi: 10.1101/011650.
 
+- [FusionInspector](https://github.com/FusionInspector/FusionInspector)
+
+  > Haas BJ, Dobin A, Ghandi M, Van Arsdale A, Tickle T, Robinson JT, Gillani R, Kasif S, Regev A. Targeted in silico characterization of fusion transcripts in tumor and normal tissues via FusionInspector. Cell Reports Methods. 2023 May 3:5, doi: 10.1016/j.crmeth.2023.100467, PMID: 37323575
+
 - [Fusion-report](https://github.com/matq007/fusion-report)
 
   > Proks M, Genomic Profiling of a Comprehensive Nation-wide Collection of Childhood Solid Tumors, Master Thesis, Supervisors: Grøntved L, Díaz de Ståhl T, Nistér M, Ewels P, Garcia MU, Juhos S, University of Southern Denmark, 2019, unpublished.
 
+- [GATK4](https://gatk.broadinstitute.org/hc/en-us)
+
+  > Van der Auwera GA. Somatic variation discovery with GATK4. Proceedings of the American Association for Cancer Research Annual Meeting 2017. 2017 Apr 1-5. Cancer Res 2017;77(13 Suppl) doi:10.1158/1538-7445.AM2017-3590
+
 - [Kallisto](https://pachterlab.github.io/kallisto/)
 
   > Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 2016 Apr. 34, 525–527. doi:10.1038/nbt.3519. PMID: 27043002.
 
+- [MegaFusion](https://github.com/J35P312/MegaFusion)
+
+- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
+
+  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
+
+- [picard-tools](http://broadinstitute.github.io/picard)
+
 - [Pizzly](https://github.com/pmelsted/pizzly)
   Melsted P, Hateley S, Joseph IC, Pimentel H, Bray N, Pachter L. Fusion detection and quantification by pseudoalignment. BioRxiv, 2017 Jul. doi: 10.1101/166322.
 
+- [Qualimap 2](https://pubmed.ncbi.nlm.nih.gov/26428292/)
+
+  > Okonechnikov K, Conesa A, García-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data Bioinformatics. 2016 Jan 15;32(2):292-4. doi: 10.1093/bioinformatics/btv566. Epub 2015 Oct 1. PubMed PMID: 26428292; PubMed Central PMCID: PMC4708105.
+
 - [SAMtools](https://pubmed.ncbi.nlm.nih.gov/19505943/)
 
   > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

diff --git a/README.md b/README.md
@@ -12,81 +12,62 @@
 
 ## Introduction
 
-**nf-core/rnafusion** is a bioinformatics best-practice analysis pipeline for RNA sequencing analysis pipeline with curated list of tools for detecting and visualizing fusion genes.
+**nf-core/rnafusion** is a bioinformatics best-practice analysis pipeline for RNA sequencing consisting of several tools designed for detecting and visualizing fusion genes. Results from up to 5 fusion callers tools are created, and are also aggregated, most notably in a pdf visualiation document, a vcf data collection file, and html and tsv reports.
 
-The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
-
-> **IMPORTANT: conda is not supported currently.** Run with singularity or docker.
-
-> GRCh38 is the only supported reference
-
-| Tool                                                      | Version  |
-| --------------------------------------------------------- | :------: |
-| [Arriba](https://github.com/suhrig/arriba)                | `2.3.0`  |
-| [FusionCatcher](https://github.com/ndaniel/fusioncatcher) |  `1.33`  |
-| [Pizzly](https://github.com/pmelsted/pizzly)              | `0.37.3` |
-| [Squid](https://github.com/Kingsford-Group/squid)         |  `1.5`   |
-| [STAR-Fusion](https://github.com/STAR-Fusion/STAR-Fusion) | `1.10.1` |
-| [StringTie](https://github.com/gpertea/stringtie)         | `2.2.1`  |
-
-> Single-end reads are to be use as last-resort. Paired-end reads are recommended. FusionCatcher cannot be used with single-end reads shorter than 130 bp.
-
-On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/rnafusion/results).
+On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/rnafusion/results).
 
 In rnafusion the full-sized test includes reference building and fusion detection. The test dataset is taken from [here](https://github.com/nf-core/test-datasets/tree/rnafusion/testdata/human).
 
 ## Pipeline summary
 
 ![nf-core/rnafusion metro map](docs/images/nf-core-rnafusion_metro_map.png)
 
-#### Build references
+### Build references
 
-`--build_references` triggers a parallel workflow to build all references
+`--build_references` triggers a parallel workflow to build references, which is a prerequisite to running the pipeline:
 
 1. Download ensembl fasta and gtf files
-2. Create STAR index
-3. Download arriba references
-4. Download fusioncatcher references
-5. Download pizzly references (kallisto index)
-6. Download and build STAR-fusion references
-7. Download fusion-report DBs
+2. Create [STAR](https://github.com/alexdobin/STAR) index
+3. Download [Arriba](https://github.com/suhrig/arriba) references
+4. Download [FusionCatcher](https://github.com/ndaniel/fusioncatcher) references
+5. Download [Pizzly](https://github.com/pmelsted/pizzly) references ([kallisto](https://pachterlab.github.io/kallisto/manual) index)
+6. Download and build [STAR-Fusion](https://github.com/STAR-Fusion/STAR-Fusion) references
+7. Download [Fusion-report](https://github.com/Clinical-Genomics/fusion-report) DBs
 
 #### Main workflow
 
 1. Input samplesheet check
-2. Concatenate fastq files per sample
-3. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-4. Arriba subworkflow
+2. Concatenate fastq files per sample ([cat](http://www.linfo.org/cat.html))
+3. Reads quality control ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
+4. Optional trimming with [fastp](https://github.com/OpenGene/fastp)
+5. Arriba subworkflow
    - [STAR](https://github.com/alexdobin/STAR) alignment
-   - [Samtool](https://github.com/samtools/samtools) sort
-   - [Samtool](https://github.com/samtools/samtools) index
    - [Arriba](https://github.com/suhrig/arriba) fusion detection
-5. Pizzly subworkflow
+6. Pizzly subworkflow
    - [Kallisto](https://pachterlab.github.io/kallisto/) quantification
    - [Pizzly](https://github.com/pmelsted/pizzly) fusion detection
-6. Squid subworkflow
+7. Squid subworkflow
    - [STAR](https://github.com/alexdobin/STAR) alignment
    - [Samtools view](http://www.htslib.org/): convert sam output from STAR to bam
    - [Samtools sort](http://www.htslib.org/): bam output from STAR
    - [SQUID](https://github.com/Kingsford-Group/squid) fusion detection
    - [SQUID](https://github.com/Kingsford-Group/squid) annotate
-7. STAR-fusion subworkflow
+8. STAR-fusion subworkflow
    - [STAR](https://github.com/alexdobin/STAR) alignment
    - [STAR-Fusion](https://github.com/STAR-Fusion/STAR-Fusion) fusion detection
-8. Fusioncatcher subworkflow
+9. Fusioncatcher subworkflow
    - [FusionCatcher](https://github.com/ndaniel/fusioncatcher) fusion detection
-9. Fusion-report subworkflow
-   - Merge all fusions detected by the different tools
-   - [Fusion-report](https://github.com/matq007/fusion-report)
-10. FusionInspector subworkflow
+10. StringTie subworkflow
+    - [StringTie](https://ccb.jhu.edu/software/stringtie/)
+11. Fusion-report
+    - Merge all fusions detected by the selected tools with [Fusion-report](https://github.com/Clinical-Genomics/fusion-report)
+12. Post-processing and analysis of data
     - [FusionInspector](https://github.com/FusionInspector/FusionInspector)
     - [Arriba](https://github.com/suhrig/arriba) visualisation
-11. Stringtie subworkflow
-    - [StringTie](https://ccb.jhu.edu/software/stringtie/index.shtml)
-12. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
-13. QC for mapped reads ([`QualiMap: BAM QC`](https://kokonech.github.io/qualimap/HG00096.chr20_bamqc/qualimapReport.html))
-14. Index mapped reads ([samtools index](http://www.htslib.org/))
-15. Collect metrics ([`picard CollectRnaSeqMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037057492-CollectRnaSeqMetrics-Picard-) and ([`picard MarkDuplicates`](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-))
+    - QC for mapped reads ([`QualiMap: BAM QC`](https://kokonech.github.io/qualimap/HG00096.chr20_bamqc/qualimapReport.html))
+    - Collect metrics ([`picard CollectRnaSeqMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037057492-CollectRnaSeqMetrics-Picard-) and ([`picard MarkDuplicates`](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-))
+13. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+14. Compress bam files to cram with [samtools view](http://www.htslib.org/)
 
 ## Usage
 
@@ -95,23 +76,36 @@ In rnafusion the full-sized test includes reference building and fusion detectio
 > to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline)
 > with `-profile test` before running the workflow on actual data.
 
-```console
-nextflow run nf-core/rnafusion --input samplesheet.csv --outdir <OUTDIR> --genome GRCh38 --all -profile <docker/singularity/podman/shifter/charliecloud/institute>
-```
+As the reference building is computationally heavy (> 24h on HPC), it is recommended to test the pipeline with the `-stub` parameter (creation of empty files):
+
+First, build the references:
 
 ```bash
-nextflow run nf-core/rnafusion --input samplesheet.csv --outdir <OUTDIR> --genome GRCh37 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
+nextflow run nf-core/rnafusion \
+   -profile <docker/singularity/.../institute> \
+   -profile test \
+   --outdir <OUTDIR>\
+   --build_references \
+   -stub
 ```
 
-> Note that paths need to be absolute and that runs with conda are not supported.
+Then perform the analysis:
 
 ```bash
 nextflow run nf-core/rnafusion \
    -profile <docker/singularity/.../institute> \
-   --input samplesheet.csv \
-   --outdir <OUTDIR>
+   -profile test \
+   --outdir <OUTDIR>\
+   -stub
 ```
 
+> **Notes:**
+>
+> - Conda is not currently supported; run with singularity or docker.
+> - Paths need to be absolute.
+> - GRCh38 is the only supported reference.
+> - Single-end reads are to be used as last-resort. Paired-end reads are recommended. FusionCatcher cannot be used with single-end reads shorter than 130 bp.
+
 > **Warning:**
 > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
 > provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -1,7 +1,7 @@
 report_comment: >
-  This report has been generated by the <a href="https://github.com/nf-core/rnafusion/2.4.0dev" target="_blank">nf-core/rnafusion</a>
+  This report has been generated by the <a href="https://github.com/nf-core/rnafusion/3.0.0dev" target="_blank">nf-core/rnafusion</a>
   analysis pipeline. For information about how to interpret these results, please see the
-  <a href="https://nf-co.re/rnafusion/2.4.0dev/output" target="_blank">documentation</a>.
+  <a href="https://nf-co.re/rnafusion/3.0.0dev/output" target="_blank">documentation</a>.
 report_section_order:
   "nf-core-rnafusion-methods-description":
     order: -1000

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv