Initial version of FORTE output documentation

mskcc · Jul 19, 2023 · d0899f0 · d0899f0
1 parent 5aeb38e
commit d0899f0
Showing 1 changed file with 149 additions and 23 deletions.
diff --git a/docs/output.md b/docs/output.md
@@ -1,57 +1,183 @@
-# mskcc/forte: Output
+# anoronh4/forte: Output
 
 ## Introduction
 
-This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
+This document describes the output produced by the FORTE pipeline.
 
-The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
-
-<!-- TODO nf-core: Write this documentation describing your workflow's output -->
+The directories listed below will be created in the results directory (after the pipeline has finished. All paths are relative to the top-level results directory.
 
 ## Pipeline overview
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [FastQC](#fastqc) - Raw read QC
-- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
+- [Read Preprocessing](#read-preprocessing)
+- [Alignment](#alignment)
+- [Quantification](#quantification)
+- [Fusion Calling](#fusion-calling)
+- [QC](#qc)
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 
-### FastQC
+### Read Preprocessing
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `analysis/<sample>/fastp/`
+  - `*.fastp.html`
+  - `*.fastp.json`
+  - `*.fastp.log`
+  - `*.fastp.fastq.gz`
+- `analysis/<sample>/umitools/extract/`
+  - `logs/.umi_extract.log`
+
+</details>
+
+[FastP](https://github.com/OpenGene/fastp) gives general quality metrics about your sequenced reads and also trims the reads according to base quality and presence of adapter sequences.
+
+[UMI-tools extract](https://umi-tools.readthedocs.io/en/latest/reference/extract.html) removes UMI sequences from reads and adds it to the read header. As a result, aligners do not attempt to align the UMI sequence and the aligned reads will be ready for deduplication.
+
+### Alignment
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `analysis/<sample>/STAR/`
+  - `*.Aligned.sortedByCoord.out.bam`
+  - `*.Aligned.sortedByCoord.out.bam.bai`
+  - `log/`
+    - `*.Log.out`
+    - `*.Log.final.out`
+    - `*.Log.progress.out`
+    - `*.ReadsPerGene.out.tab`
+    - `*.SJ.out.tab`
+- `analysis/<sample>/umitools/dedup/`
+  - `*.dedup.bam`
+  - `*.dedup.bam.bai`
+  - `logs/`
+    - `*.dedup_edit_distance.tsv`
+    - `*.dedup_per_umi_per_position.tsv`
+    - `*.dedup_per_umi.tsv`
+
+</details>
+
+[STAR](https://github.com/alexdobin/STAR) is an ultrafast universal RNA-seq aligner.
+
+[UMI-tools dedup](https://umi-tools.readthedocs.io/en/latest/reference/dedup.html) deduplicates reads based on the mapping co-ordinate and the UMI attached to the read.
+
+### Quantification
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `fastqc/`
-  - `*_fastqc.html`: FastQC report containing quality metrics.
-  - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
+- `analysis/<sample>/htseq/`
+  - `*.htseq.count.txt`
+- `analysis/<sample>/kallisto/`
+  - `abundance.h5`
+  - `abundance.tsv`
+  - `run_info.json`
+  - `*.log.txt`
 
 </details>
 
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
+[HTseq-count](https://htseq.readthedocs.io/en/master/htseqcount.html) takes a file with aligned sequencing reads, plus a list of genomic features and counts how many reads map to each feature.
 
-![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
+[Kallisto](http://pachterlab.github.io/kallisto/) quantifies abundances of transcripts from RNA-Seq data using high-throughput sequencing reads.
 
-![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
+### Fusion Calling
 
-![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
+<details markdown="1">
+<summary>Output files</summary>
 
-> **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
+- `analysis/<sample>/arriba/`
+  - `*.fusions.discarded.tsv`
+  - `*.fusions.tsv`
+- `analysis/<sample>/fusioncatcher/`
+  - `*.fusioncatcher.fusion-genes.hg19.txt`
+  - `*.fusioncatcher.fusion-genes.txt`
+  - `*.fusioncatcher.log`
+  - `*.fusioncatcher.summary.txt`
+- `analysis/<sample>/starfusion/`
+  - `*.starfusion.abridged.coding_effect.tsv`
+  - `*.starfusion.abridged.tsv`
+  - `*.starfusion.fusion_predictions.tsv`
+  - `STAR/`
+    - `*.Chimeric.out.junction`
+    - `log/`
+      - `*.Log.final.out`
+      - `*.Log.out`
+      - `*.Log.progress.out`
+      - `*.SJ.out.tab`
+
+</details>
 
-### MultiQC
+[Arriba](https://arriba.readthedocs.io/en/latest/) uses the STAR aligner to detect of gene fusions from RNA-Seq data.
+
+[FusionCatcher](https://github.com/ndaniel/fusioncatcher) searches for novel/known somatic fusion genes, translocations, and chimeras in RNA-seq data.
+
+[STAR-Fusion](https://github.com/STAR-Fusion/STAR-Fusion) uses the STAR aligner to identify candidate fusion transcripts supported by Illumina reads.
+
+*More coming soon...*
+
+### QC
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `multiqc/`
-  - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
-  - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
-  - `multiqc_plots/`: directory containing static images from the report in various formats.
+- `analysis/<sample>/picard/`
+  - `*.rna_metrics`
+  - `*.CollectHsMetrics.coverage_metrics`
+- `analysis/<sample>/rseqc/`
+  - `*.bam_stat.txt`
+  - `*.DupRate_plot.pdf`
+  - `*.DupRate_plot.r`
+  - `*.infer_experiment.txt`
+  - `*.inner_distance_freq.txt`
+  - `*.inner_distance_mean.txt`
+  - `*.inner_distance_plot.pdf`
+  - `*.inner_distance_plot.r`
+  - `*.inner_distance.txt`
+  - `*.junction_annotation.log`
+  - `*.junction.bed`
+  - `*.junction.Interact.bed`
+  - `*.junction_plot.r`
+  - `*.junctionSaturation_plot.pdf`
+  - `*.junctionSaturation_plot.r`
+  - `*.junction.xls`
+  - `*.pos.DupRate.xls`
+  - `*.read_distribution.txt`
+  - `*.seq.DupRate.xls`
+  - `*.splice_events.pdf`
+  - `*.splice_junction.pdf`
+- `analysis/<sample>/multiqc/`
+  - `dedupbam_multiqc_report_data/`
+    - `*.json`
+    - `*.log`
+    - `*.txt`
+  - `dedupbam_multiqc_report.html`
+  - `dedupbam_multiqc_report_plots/`
+    - `pdf/*.pdf`
+    - `png/*.png`
+    - `svg/*.svg`
+  - `dupbam_multiqc_report_data/`
+    - `*.json`
+    - `*.log`
+    - `*.txt`
+  - `dupbam_multiqc_report.html`
+  - `dupbam_multiqc_report_plots/`
+    - `pdf/*.pdf`
+    - `png/*.png`
+    - `svg/*.svg`
+
 
 </details>
 
-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
+[Picard's CollectHsMetrics](https://gatk.broadinstitute.org/hc/en-us/articles/360036856051-CollectHsMetrics-Picard-) collects hybrid-selection (HS) metrics for a SAM or BAM file. This is only produced if baitset is indicated in the samplesheet.
+
+[Picard's CollectRnaSeqMetrics](https://gatk.broadinstitute.org/hc/en-us/articles/360037057492-CollectRnaSeqMetrics-Picard-) produces RNA alignment metrics for a SAM or BAM file.
+
+[RSeQC](https://rseqc.sourceforge.net/) provides a number of useful modules that can comprehensively evaluate high throughput RNAseq data.
 
-Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
+[MultiQC](https://multiqc.info/) is a visualization tool that searches a given directory for analysis/qc logs and compiles a HTML report. Most of the pipeline QC results are visualized in the report and further statistics are available in the report data directory. FORTE produces a second MultiQC report for each sample that has UMI. FORTE also produces 1-2 reports under the `multiqc/` folder where all samples are aggregated together, one for non-deduplicated results and the other for deduplicated results.
 
 ### Pipeline information