Pipeline version 4.0
Summary
Many minor changes to all somatic algorithms plus addition of GRIDSS structural variant caller.
Removal of KG pipeline and removal of tumor GATK calling.
Various resources and JARs used by the pipeline can be found on https://resources.hartwigmedicalfoundation.nl.
Improvements to somatic SNV / Indel calling
- To improve sensitivity, variants on known pathogenic locations are retained all the way through Strelka if they are called by the initial Strelka (raw) caller. The list used by HMF can be found on the resources page and is based on CiViC, CGI and OncoKB, appended with a few promotor positions in TERT gene.
- Post-strelka, variants are annotated with a mapping probability based on information known about the mappability of positions in the ref genome.
- Switched from Germline PON v1.1 to Germline PON v2.0
- Added a Somatic PON which filters out specific Strelka artefacts.
- Added MNV merging. Variants that potentially affect the same codon(s) are checked for phasing and merged if they are phased. This is done within the Strelka Post Process JAR.
- Cosmic annotation has been adjusted such that the COSMIC ID for every transcript affected by a variant is included, not just a random single COSMIC ID. Information is provided in the INFO to pick the COSMIC ID for a specific transcript.
Added GRIDSS as an additional somatic structural variant caller
- GRIDSS is implemented next to Manta/BPI and our intention is to eventually replace Manta/BPI since we expect it to perform better across our cohort of samples. All documentation on GRIDSS can be found on https://github.com/PapenfussLab/gridss.
Other changes
- Germline calling is now only performed on the reference sample and hence the germline VCF contains the calls for just one sample.
- Every final VCF (germline, somatic, sv, etc) is gzipped and a tabix index is provided along with the gzipped VCF.
- The kinship test to detect sample swaps is replaced by a test based on BAF scores. The main reason is that kinship penalises het-to-hom transitions, which happen in relation to the degree of LOH. Using BAFs, we can detect sample swaps by observing a mean BAF that significantly deviates from 0.5, which is independent of degree of LOH in the tumor.
- The QC checks are now run as part of the pipeline while they previously used to be a post-pipeline step.
- KG configuration is no longer supported, but there is an INI to analyse just a single sample. This ini runs the algorithms that would normally be run on the reference sample of a somatic pair of samples.
New tool versions
- GRIDSS introduced at version v1.8.0 (using bwa v0.7.17)
Version changes
- Purple v1.2 to v2.14
- Cobalt v1.0 to v1.4
- Amber v1.0 to v1.5
- BPI v1.2 to v1.6
- Strelka Post Process v1.0 to v1.4
- HealthChecker v2.1 to v2.4
- GATK v3.4.46 to v3.8
- snpEff v4.1h to v4.3s
Quality
Since we don't have a KG pipeline anymore we don't report germline precision and sensitivity.
The somatic precision and sensitivity of SNVs and Indels is determined on an internally sequenced GIAB-mix of 70% NA24385 and 30% NA12878 against 100% of NA24385 as reference sample. Results are as follows:
Somatic precision & sensitivity
Type | Algo | TP | FP | FN | Prec | Sens | Δ Prec | Δ Sens |
---|---|---|---|---|---|---|---|---|
INDEL | Strelka | 74360 | 641 | 22412 | 99,1% | 76,8% | -0.1% | -0.2% |
SNV | Strelka | 955590 | 1253 | 38084 | 99,9% | 96,2% | 0% | 0% |
MNV | Strelka | 6868 | 21 | 0 | 99,7% | 100,0% | - | - |
- Note: The differences between v3 are entirely attributed to changes we made in the way we measure the above numbers. Running the same method between v3 and v4 yields no differences which is as-expected since we made no changes that significantly affects either sensitivity or precision.
In addition, to measure exact false positive rate, we analyse a sample against itself in roughly 30x/100x coverage. With pipeline v4.0 release we find 136 false positives in total across the whole genome (109 SNVs and 27 INDELs).