This repo provides workflows that takes advantage of GATKs CNN tool which is a deep learning approach to filter variants based on Convolutional Neural Networks.
Please read the following discussion to learn more about the CNN tool: Deep Learning in GATK4.
This workflow takes an input CRAM/BAM to call variants with HaplotypeCaller then filters the calls with the CNNVariant neural net tool using the filtering model specified.
The site-level scores are added to the INFO
field of the VCF. The architecture arguments,
info_key
and tensor_type
arguments MUST be in agreement (e.g. 2D models must have
tensor_type
of read_tensor
and info_key
of CNN_2D
, 1D models have tensor_type
of
reference
and info_key
of CNN_1D
). The INFO
field key will be 1D_CNN
or 2D_CNN
depending on the neural net architecture used for inference. The architecture arguments
specify pre-trained networks. New networks can be trained by the GATK tools: CNNVariantWriteTensors
and CNNVariantTrain. The CRAM could be generated by the single-sample pipeline.
If you would like test the workflow on a more representative example file, use the following
CRAM file as input and change the scatter count from 4 to 200: gs://gatk-best-practices/cnn-h38/NA12878_NA12878_IntraRun_1_SM-G947Y_v1.cram.
- CRAM/BAM
- BAM Index (if input is BAM)
- Filtered VCF and its index.
This optional workflow is for advanced users who would like to train a CNN model for filtering variants.
- CRAM
- Truth VCF and its index
- Truth Confidence Interval Bed
- Model HD5
- Model JSON
- Model Plots PNG
This optional evaluation and plotting workflow runs a filtering model against truth data (e.g. NIST Genomes in a Bottle, Synthic Diploid Truth Set ) and plots the accuracy.
- File of VCF Files
- Truth VCF and its index
- Truth Confidence Interval Bed
- Evaluation summary
- Plots
- GATK 4.1.4.0
- samtools 1.3.1
- Cromwell version support
- Successfully tested on v47
- Does not work on versions < v23 due to output syntax
- Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
- For help running workflows on the Google Cloud Platform or locally please view the following tutorial (How to) Execute Workflows from the gatk-workflows Git Organization.
- Please visit the User Guide site for further documentation on our workflows and tools.
- The following material is provided by the Data Science Platforum group at the Broad Institute. Please direct any questions or concerns to one of our forum sites : GATK or Terra.
This script is released under the WDL source code license (BSD-3) (see LICENSE in https://github.com/broadinstitute/wdl). Note however that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.