merge branch develop

suhrig · Jan 4, 2020 · 1995f7b · 1995f7b
2 parents 477f92b + e437c63
commit 1995f7b
Show file tree

Hide file tree

Showing 143 changed files with 230 additions and 65,234 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "htslib"]
+	path = htslib
+	url = https://github.com/samtools/htslib.git
diff --git a/Makefile b/Makefile
@@ -1,7 +1,7 @@
 # input directories
-HTSLIB := htslib-1.8
+HTSLIB := htslib
 SOURCE := source
-STATIC_LIBS := static_libs_centos6.9
+STATIC_LIBS := static_libs_centos6.10
 
 # compiler flags
 CXX := g++
@@ -31,7 +31,7 @@ clean:
 	$(MAKE) -C $(HTSLIB) clean
 
 release:
-	$(MAKE) LIBS_SO="" LIBS_A="$(LIBS_A) $(wildcard $(STATIC_LIBS)/*.a)" CPPFLAGS="-DHAVE_LIBDEFLATE $(CPPFLAGS) -I../$(STATIC_LIBS)"
+	$(MAKE) LIBS_SO="" LIBS_A="$(LIBS_A) $(wildcard $(STATIC_LIBS)/*.a)" CPPFLAGS="-DHAVE_LIBDEFLATE $(CPPFLAGS) -I$(STATIC_LIBS) -I../$(STATIC_LIBS)"
 
 bioconda:
 	$(MAKE) LIBS_SO="-ldl -lhts -ldeflate $(LIBS_SO)" LIBS_A="" CPPFLAGS="-DHAVE_LIBDEFLATE $(CPPFLAGS)" LDFLAGS="$(LDFLAGS)"

diff --git a/documentation/command-line-options.md b/documentation/command-line-options.md
@@ -22,7 +22,7 @@ arriba [-c Chimeric.out.sam] -x Aligned.out.sam \
 : GTF file with gene annotation. The file may be gzip-compressed.
 
 `-G GTF_FEATURES`
-: Comma-/space-separated list of names of GTF features. The names of features in GTF files are not standardized. Different publishers use different names for the same features. For example, GENCODE uses `gene_type` for the gene type feature, whereas ENSEMBL uses `gene_biotype`. In order that Arriba can parse the GTF files from various publishers, the names of GTF features is configurable. Alternative names for one and the same feature can be specified by using the pipe symbol as a separator (`|`). Arriba supports a set of names which is suitable for RefSeq, GENCODE, and ENSEMBL. Default: `gene_name=gene_name gene_id=gene_id transcript_id=transcript_id gene_status=gene_status|gene_type|gene_biotype status_KNOWN=KNOWN|protein_coding gene_type=gene_type|gene_biotype type_protein_coding=protein_coding feature_exon=exon feature_UTR=UTR feature_gene=gene`
+: Comma-/space-separated list of names of GTF features. The names of features in GTF files are not standardized. Different publishers use different names for the same features. For example, GENCODE uses `gene_type` for the gene type feature, whereas ENSEMBL uses `gene_biotype`. In order that Arriba can parse the GTF files from various publishers, the names of GTF features is configurable. Alternative names for one and the same feature can be specified by using the pipe symbol as a separator (`|`). Arriba supports a set of names which is suitable for RefSeq, GENCODE, and ENSEMBL. Default: `gene_name=gene_name|gene_id gene_id=gene_id transcript_id=transcript_id feature_exon=exon feature_CDS=CDS`
 
 `-a FILE`
 : FastA file with genome sequence (assembly). The file may be gzip-compressed. An index with the file extension `.fai` must exist only if CRAM data is processed.

diff --git a/documentation/draw-fusions-example.png b/documentation/draw-fusions-example.png
diff --git a/documentation/index.md b/documentation/index.md
@@ -10,7 +10,7 @@ Arriba is the winner of the [DREAM SMC-RNA Challenge](https://www.synapse.org/SM
 License
 -------
 
-Apart from the script `draw_fusions.R` all software/code of Arriba is disributed under the MIT/Expat License. The script `draw_fusions.R` is distributed under the GNU GPL v3 due to dependencies on GPL-licensed R packages. The terms and conditions of both licenses can be found in the [LICENSE file](https://raw.githubusercontent.com/suhrig/arriba/master/LICENSE).
+Apart from the script `draw_fusions.R` all software/code of Arriba is distributed under the MIT/Expat License. The script `draw_fusions.R` is distributed under the GNU GPL v3 due to dependencies on GPL-licensed R packages. The terms and conditions of both licenses can be found in the [LICENSE file](https://raw.githubusercontent.com/suhrig/arriba/master/LICENSE).
 
 Citing
 ------

diff --git a/documentation/input-files.md b/documentation/input-files.md
@@ -84,19 +84,29 @@ The file has two columns separated by a tab. Each line lists a pair of genes. Th
 Structural variant calls from WGS
 ---------------------------------
 
-If whole-genome sequencing (WGS) data is available, the sensitivity and specificity of Arriba can be improved by passing a list of structural variants detected from WGS to Arriba:
+If whole-genome sequencing (WGS) data is available, the sensitivity and specificity of Arriba can be improved by passing a list of structural variants detected from WGS to Arriba via the parameter `-d`. This has the following effects:
 
 - Certain filters are overruled or run with extra sensitive settings, when an event is confirmed by WGS data.
 
 - To reduce the false positive rate, Arriba does not report low-confidence events unless they can be matched with a structural variant found in the WGS data.
 
-Both of these behaviors can be disabled by disabling the filters `genomic_support` and `no_genomic_support`, respectively. Providing Arriba with a list of structural variant calls then does not influence the calls, but it still has the benefit of filling the columns `closest_genomic_breakpoint1` and `closest_genomic_breakpoint2` with the breakpoints of the structural variant which is closest to a fusion.
+Both of these behaviors can be disabled by disabling the filters `genomic_support` and `no_genomic_support`, respectively. Providing Arriba with a list of structural variant calls then does not influence the calls, but it still has the benefit of filling the columns `closest_genomic_breakpoint1` and `closest_genomic_breakpoint2` with the breakpoints of the structural variant which is closest to a fusion. If the structural variant calls were obtained from whole-exome sequencing (WES) data rather than WGS data, the filter `no_genomic_support` should be disabled, since WES has poor coverage in most regions of the genome, such that many structural variants are missed.
 
 The file must contain four columns separated by tabs. The first two columns contain the breakpoints of the structural variants in the format `CONTIG:POSITION`. The last two columns contain the orientation of the breakpoints. The accepted values are:
 
 - `downstream` or `+`: the fusion partner is fused downstream of the breakpoint, i.e., at a coordinate higher than the breakpoint
 
 - `upstream` or `-`: the fusion partner is fused at a coordinate lower than the breakpoint
 
+Example:
+
+```
+1:54420491	6:9248349	+	-
+20:46703288	20:46734546	-	+
+17:61499820	20:45133874	+	+
+3:190967119	7:77868317	-	-
+```
+
 Arriba checks if the orientation of the structural variant matches that of a fusion detected in the RNA-Seq data. If, for example, Arriba predicts the 5' end of a gene to be retained in a fusion, then a structural variant is expected to confirm this, or else the variant is not considered to be related.
 
+Note: Arriba was designed for alignments from RNA-Seq data. It should not be run on WGS data directly. Many assumptions made by Arriba about the data (statistical models, blacklist, etc.) only apply to RNA-Seq data and are not valid for DNA-Seq data. For such data, a structural variant calling algorithm should be used and the results should be passed to Arriba.
diff --git a/documentation/internal-algorithm.md b/documentation/internal-algorithm.md
@@ -41,7 +41,7 @@ The rationale for this filter is the same as for the filter `same_gene`: These a
 : This filter removes a fragment, when both of its mates align to the same gene in an orientation that could arise from canonical splicing. Potentially, these alignments could indicate small intragenic deletions, but more likely they arise from splicing and should be ignored, hence.
 
 `hairpin`
-: A large fraction of the candidates found by STAR are events with a distance between the breakpoints that is smaller than the fragment size. Presumably, these are artifacts introduced during extraction, library preparation, or sequencing that arise from molecules folding back on themselves. These hairpin structures might lead to spontaneous ligations within the molecule or serve as primers for polymerases and facilitate [template switching during PCR or reverse transcription](https://doi.org/10.1371/journal.pone.0012271). When a strand-specific library has been used, a lot of small duplication events are produced; unstranded libraries produce predominantly small inversions. In order to filter these probable false positives, Arriba removes fragments with a transcriptomic distance (i.e., ignoring introns) of less than the mean fragment size plus three standard deviations. The fragment size is estimated automatically or - when single-end data is supplied - a size of 200 nt is assumed. The fragment size can be overwritten via the parameter `-F`.
+: A large fraction of the candidates found by STAR are events with a distance between the breakpoints that is smaller than the fragment size. Presumably, these are artifacts introduced during extraction, library preparation, or sequencing that arise from molecules folding back on themselves. These hairpin structures might lead to spontaneous ligations within the molecule or serve as primers for polymerases and facilitate [template switching during PCR or reverse transcription](https://doi.org/10.1371/journal.pone.0012271). When a strand-specific library is used, a lot of small duplication events are produced; unstranded libraries produce predominantly small inversions. In order to filter these probable false positives, Arriba removes fragments with a transcriptomic distance (i.e., ignoring introns) of less than the mean fragment size plus three standard deviations. The fragment size is estimated automatically or - when single-end data is supplied - a size of 200 nt is assumed. The fragment size can be overwritten via the parameter `-F`.
 
 `mismatches`
 : This filter discards alignments with a high number of reference mismatches relative to the length of the aligned segment. A binomial model is employed to determine statistical significance. The sequencing error rate is assumed to be 1%. The significance cut-off can be adjusted via the parameter `-V` (default 1%).
@@ -59,7 +59,7 @@ Event-level filters
 : Genes which are not well studied suffer from incomplete annotation. Many exons are annotated as separate genes even though they might actually be part of one and the same gene. Predicted genes named `RP11-...` are common examples for this. When poorly understood genes lie next to each other on the same strand, this would frequently lead to false positive predictions of deletions, because the transcripts that span both genes give rise to reads, which resemble focal deletions. This filter discards deletions, which are predicted between two neighboring genes, if both genes are non-coding or one breakpoint is intergenic.
 
 `intragenic_exonic`
-: Since exons usually make up only a small fraction of a gene, it is more likely that a genomic rearrangement starts and ends in intronic regions. On the transcriptomic level, this manifests as breakpoints at splice-sites or in introns. Many candidates found by STAR have both breakpoints within exons of the same gene. This is particularly true for intragenic events, which are prone to PCR-mediated artifacts. This filter removes intragenic events, if both breakpoints are in exons and more than 80% of the region between the breakpoints is intronic, such that it should be very unlikely that both breakpoints are located inside exons (see parameter `-e`).
+: Since exons usually make up only a small fraction of a gene, it is more likely that a genomic rearrangement starts and ends in intronic regions. On the transcriptomic level, this manifests as breakpoints at splice-sites or in introns. Many candidates found by STAR have both breakpoints within exons of the same gene. This is particularly true for intragenic events, which are prone to in vitro artifacts. This filter removes intragenic events, if both breakpoints are in exons and more than 80% of the region between the breakpoints is intronic, such that it should be very unlikely that both breakpoints are located inside exons (see parameter `-e`).
 
 `min_support`
 : This filter discards all events with fewer reads than specified by the parameter `-S` (default 2).
@@ -74,7 +74,7 @@ Event-level filters
 : When a list of highly recurrent fusions is supplied (see parameter `-k`), this filter recovers events which were discarded because of too few supporting reads, as long as there is no other indication that the event might be an artifact.
 
 `pcr_fusions`
-: In some tissues certain genes are expressed at very high levels, for example hemoglobin and fibrinogen in blood or collagens in connective tissue. Presumably, the abundance of fragments from such genes increases the chance of unrelated molecules sticking together during PCR and serving as primers, which generates a large amount of chimeric fragments in vitro. Such PCR-mediated fusions can be recognized as an extraordinary number of events with breakpoints within exons (rather than at exon boundaries, which is more common for true predictions). This filter eliminates events with genes that are highly expressed (top 0.2%) and have an unbalanced number of split-reads vs. discordant mates or that have an excessive amount of intra-exonic breakpoints.
+: In some tissues certain genes are expressed at very high levels, for example hemoglobin and fibrinogen in blood or collagens in connective tissue. Presumably, the abundance of fragments from such genes increases the chance of unrelated molecules sticking together, which can serve as undesired primers for PCR or [may cause the reverse transcriptase enzyme to switch templates](https://doi.org/10.1371/journal.pone.0012271). These processes generate a large amount of chimeric fragments in vitro. Such artifactual fusions can be recognized as an extraordinary number of events with breakpoints within exons (rather than at exon boundaries, which is more common for true predictions). This filter eliminates events with genes that are highly expressed (top 0.2%) and have an unbalanced number of split-reads vs. discordant mates or that have an excessive amount of intra-exonic breakpoints.
 
 `spliced`
 : This filter recovers events discarded due to a low number of supporting reads, given that both breakpoints of the event are at splice-sites and there is at least one additional event linking the same pair of genes.