Merge pull request #396 from nf-core/update_documentation

update intro and usage documentation
nf-core · Sep 13, 2023 · 2bfdb59 · 2bfdb59
2 parents 17074d8 + 3395388
commit 2bfdb59
Show file tree

Hide file tree

Showing 10 changed files with 449 additions and 385 deletions.
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,34 +10,57 @@
 
 ## Pipeline tools
 
-- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+- [Arriba](https://github.com/suhrig/arriba)
 
-  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
+  > Uhrig S, Ellermann J, Walther T, Burkhardt P, Fröhlich M, Hutter B, Toprak UH, Neumann O, Stenzinger A, Scholl C, Fröhling S, Brors B. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Research. 2021 Mar 31;448-460. doi: 10.1101/gr.257246.119. Epub 2021 Jan 13. PubMed PMID: 33441414.
 
-- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
+- [BEDOPS](https://bedops.readthedocs.io/en/latest/index.html) - convert2bed
 
-  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
+  > Neph S, Scott Kuehn M, Reynolds AP, Haugen E, Thurman RE, Johnson AK, Rynes E, Maurano MT, Vierstra J, Thomas S, Sandstrom R, Humbert R, Stamatoyannopoulos JA. BEDOPS: high-performance genomic feature operations. Bioinformatics. 2012 May, 28 (14): 1919-1920. doi: 10.1093/bioinformatics/bts277, PubMed PMID: PMID: 22576172.
 
-- [Arriba](https://github.com/suhrig/arriba)
+- [FastP](https://academic.oup.com/bioinformatics/article/34/17/i884/5093234)
 
-  > Uhrig S, Ellermann J, Walther T, Burkhardt P, Fröhlich M, Hutter B, Toprak UH, Neumann O, Stenzinger A, Scholl C, Fröhling S, Brors B. Accurate and efficient detection of gene fusions from RNA sequencing data.
-  > Genome Research. 2021 Mar 31;448-460. doi: 10.1101/gr.257246.119. Epub 2021 Jan 13. PubMed PMID: 33441414; PubMed Central PMCID: PMC7919457.
+  > Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sept 34:17 (i884–i890), doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086. PubMed Central PMCID: PMC6129281
+
+- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+
+  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
 
 - [FusionCatcher](https://github.com/ndaniel/fusioncatcher)
 
   > Nicorici D, Satalan M, Edgren H, Kangaspeska S, Murumagi A, Kallioniemi O, Virtanen S, Kilkku O. FusionCatcher – a tool for finding somatic fusion genes in paired-end RNA-sequencing data. BioRxiv, 2014 Nov. doi: 10.1101/011650.
 
+- [FusionInspector](https://github.com/FusionInspector/FusionInspector)
+
+  > Haas BJ, Dobin A, Ghandi M, Van Arsdale A, Tickle T, Robinson JT, Gillani R, Kasif S, Regev A. Targeted in silico characterization of fusion transcripts in tumor and normal tissues via FusionInspector. Cell Reports Methods. 2023 May 3:5, doi: 10.1016/j.crmeth.2023.100467, PMID: 37323575
+
 - [Fusion-report](https://github.com/matq007/fusion-report)
 
   > Proks M, Genomic Profiling of a Comprehensive Nation-wide Collection of Childhood Solid Tumors, Master Thesis, Supervisors: Grøntved L, Díaz de Ståhl T, Nistér M, Ewels P, Garcia MU, Juhos S, University of Southern Denmark, 2019, unpublished.
 
+- [GATK4](https://gatk.broadinstitute.org/hc/en-us)
+
+  > Van der Auwera GA. Somatic variation discovery with GATK4. Proceedings of the American Association for Cancer Research Annual Meeting 2017. 2017 Apr 1-5. Cancer Res 2017;77(13 Suppl) doi:10.1158/1538-7445.AM2017-3590
+
 - [Kallisto](https://pachterlab.github.io/kallisto/)
 
   > Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 2016 Apr. 34, 525–527. doi:10.1038/nbt.3519. PMID: 27043002.
 
+- [MegaFusion](https://github.com/J35P312/MegaFusion)
+
+- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
+
+  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
+
+- [picard-tools](http://broadinstitute.github.io/picard)
+
 - [Pizzly](https://github.com/pmelsted/pizzly)
   Melsted P, Hateley S, Joseph IC, Pimentel H, Bray N, Pachter L. Fusion detection and quantification by pseudoalignment. BioRxiv, 2017 Jul. doi: 10.1101/166322.
 
+- [Qualimap 2](https://pubmed.ncbi.nlm.nih.gov/26428292/)
+
+  > Okonechnikov K, Conesa A, García-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data Bioinformatics. 2016 Jan 15;32(2):292-4. doi: 10.1093/bioinformatics/btv566. Epub 2015 Oct 1. PubMed PMID: 26428292; PubMed Central PMCID: PMC4708105.
+
 - [SAMtools](https://pubmed.ncbi.nlm.nih.gov/19505943/)
 
   > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

diff --git a/README.md b/README.md
@@ -12,81 +12,62 @@
 
 ## Introduction
 
-**nf-core/rnafusion** is a bioinformatics best-practice analysis pipeline for RNA sequencing analysis pipeline with curated list of tools for detecting and visualizing fusion genes.
+**nf-core/rnafusion** is a bioinformatics best-practice analysis pipeline for RNA sequencing consisting of several tools designed for detecting and visualizing fusion genes. Results from up to 5 fusion callers tools are created, and are also aggregated, most notably in a pdf visualiation document, a vcf data collection file, and html and tsv reports.
 
-The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
-
-> **IMPORTANT: conda is not supported currently.** Run with singularity or docker.
-
-> GRCh38 is the only supported reference
-
-| Tool                                                      | Version  |
-| --------------------------------------------------------- | :------: |
-| [Arriba](https://github.com/suhrig/arriba)                | `2.3.0`  |
-| [FusionCatcher](https://github.com/ndaniel/fusioncatcher) |  `1.33`  |
-| [Pizzly](https://github.com/pmelsted/pizzly)              | `0.37.3` |
-| [Squid](https://github.com/Kingsford-Group/squid)         |  `1.5`   |
-| [STAR-Fusion](https://github.com/STAR-Fusion/STAR-Fusion) | `1.10.1` |
-| [StringTie](https://github.com/gpertea/stringtie)         | `2.2.1`  |
-
-> Single-end reads are to be use as last-resort. Paired-end reads are recommended. FusionCatcher cannot be used with single-end reads shorter than 130 bp.
-
-On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/rnafusion/results).
+On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/rnafusion/results).
 
 In rnafusion the full-sized test includes reference building and fusion detection. The test dataset is taken from [here](https://github.com/nf-core/test-datasets/tree/rnafusion/testdata/human).
 
 ## Pipeline summary
 
 ![nf-core/rnafusion metro map](docs/images/nf-core-rnafusion_metro_map.png)
 
-#### Build references
+### Build references
 
-`--build_references` triggers a parallel workflow to build all references
+`--build_references` triggers a parallel workflow to build references, which is a prerequisite to running the pipeline:
 
 1. Download ensembl fasta and gtf files
-2. Create STAR index
-3. Download arriba references
-4. Download fusioncatcher references
-5. Download pizzly references (kallisto index)
-6. Download and build STAR-fusion references
-7. Download fusion-report DBs
+2. Create [STAR](https://github.com/alexdobin/STAR) index
+3. Download [Arriba](https://github.com/suhrig/arriba) references
+4. Download [FusionCatcher](https://github.com/ndaniel/fusioncatcher) references
+5. Download [Pizzly](https://github.com/pmelsted/pizzly) references ([kallisto](https://pachterlab.github.io/kallisto/manual) index)
+6. Download and build [STAR-Fusion](https://github.com/STAR-Fusion/STAR-Fusion) references
+7. Download [Fusion-report](https://github.com/Clinical-Genomics/fusion-report) DBs
 
 #### Main workflow
 
 1. Input samplesheet check
-2. Concatenate fastq files per sample
-3. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-4. Arriba subworkflow
+2. Concatenate fastq files per sample ([cat](http://www.linfo.org/cat.html))
+3. Reads quality control ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
+4. Optional trimming with [fastp](https://github.com/OpenGene/fastp)
+5. Arriba subworkflow
    - [STAR](https://github.com/alexdobin/STAR) alignment
-   - [Samtool](https://github.com/samtools/samtools) sort
-   - [Samtool](https://github.com/samtools/samtools) index
    - [Arriba](https://github.com/suhrig/arriba) fusion detection
-5. Pizzly subworkflow
+6. Pizzly subworkflow
    - [Kallisto](https://pachterlab.github.io/kallisto/) quantification
    - [Pizzly](https://github.com/pmelsted/pizzly) fusion detection
-6. Squid subworkflow
+7. Squid subworkflow
    - [STAR](https://github.com/alexdobin/STAR) alignment
    - [Samtools view](http://www.htslib.org/): convert sam output from STAR to bam
    - [Samtools sort](http://www.htslib.org/): bam output from STAR
    - [SQUID](https://github.com/Kingsford-Group/squid) fusion detection
    - [SQUID](https://github.com/Kingsford-Group/squid) annotate
-7. STAR-fusion subworkflow
+8. STAR-fusion subworkflow
    - [STAR](https://github.com/alexdobin/STAR) alignment
    - [STAR-Fusion](https://github.com/STAR-Fusion/STAR-Fusion) fusion detection
-8. Fusioncatcher subworkflow
+9. Fusioncatcher subworkflow
    - [FusionCatcher](https://github.com/ndaniel/fusioncatcher) fusion detection
-9. Fusion-report subworkflow
-   - Merge all fusions detected by the different tools
-   - [Fusion-report](https://github.com/matq007/fusion-report)
-10. FusionInspector subworkflow
+10. StringTie subworkflow
+    - [StringTie](https://ccb.jhu.edu/software/stringtie/)
+11. Fusion-report
+    - Merge all fusions detected by the selected tools with [Fusion-report](https://github.com/Clinical-Genomics/fusion-report)
+12. Post-processing and analysis of data
     - [FusionInspector](https://github.com/FusionInspector/FusionInspector)
     - [Arriba](https://github.com/suhrig/arriba) visualisation
-11. Stringtie subworkflow
-    - [StringTie](https://ccb.jhu.edu/software/stringtie/index.shtml)
-12. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
-13. QC for mapped reads ([`QualiMap: BAM QC`](https://kokonech.github.io/qualimap/HG00096.chr20_bamqc/qualimapReport.html))
-14. Index mapped reads ([samtools index](http://www.htslib.org/))
-15. Collect metrics ([`picard CollectRnaSeqMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037057492-CollectRnaSeqMetrics-Picard-) and ([`picard MarkDuplicates`](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-))
+    - QC for mapped reads ([`QualiMap: BAM QC`](https://kokonech.github.io/qualimap/HG00096.chr20_bamqc/qualimapReport.html))
+    - Collect metrics ([`picard CollectRnaSeqMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037057492-CollectRnaSeqMetrics-Picard-) and ([`picard MarkDuplicates`](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-))
+13. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+14. Compress bam files to cram with [samtools view](http://www.htslib.org/)
 
 ## Usage
 
@@ -95,23 +76,36 @@ In rnafusion the full-sized test includes reference building and fusion detectio
 > to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline)
 > with `-profile test` before running the workflow on actual data.
 
-```console
-nextflow run nf-core/rnafusion --input samplesheet.csv --outdir <OUTDIR> --genome GRCh38 --all -profile <docker/singularity/podman/shifter/charliecloud/institute>
-```
+As the reference building is computationally heavy (> 24h on HPC), it is recommended to test the pipeline with the `-stub` parameter (creation of empty files):
+
+First, build the references:
 
 ```bash
-nextflow run nf-core/rnafusion --input samplesheet.csv --outdir <OUTDIR> --genome GRCh37 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
+nextflow run nf-core/rnafusion \
+   -profile <docker/singularity/.../institute> \
+   -profile test \
+   --outdir <OUTDIR>\
+   --build_references \
+   -stub
 ```
 
-> Note that paths need to be absolute and that runs with conda are not supported.
+Then perform the analysis:
 
 ```bash
 nextflow run nf-core/rnafusion \
    -profile <docker/singularity/.../institute> \
-   --input samplesheet.csv \
-   --outdir <OUTDIR>
+   -profile test \
+   --outdir <OUTDIR>\
+   -stub
 ```
 
+> **Notes:**
+>
+> - Conda is not currently supported; run with singularity or docker.
+> - Paths need to be absolute.
+> - GRCh38 is the only supported reference.
+> - Single-end reads are to be used as last-resort. Paired-end reads are recommended. FusionCatcher cannot be used with single-end reads shorter than 130 bp.
+
 > **Warning:**
 > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
 > provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
diff --git a/assets/samplesheet_valid.csv b/assets/samplesheet_valid.csv
diff --git a/bin/get_rrna_transcripts.py b/bin/get_rrna_transcripts.py
@@ -8,7 +8,6 @@
 
 def get_rrna_intervals(file_in, file_out):
     """
-    Get the commented out header
     Get lines containing ``#`` or ``gene_type rRNA`` or ```` or ``gene_type rRNA_pseudogene`` or ``gene_type MT_rRNA``
     Create output file