Author: William Rowell [email protected]
In this case study we describe applying DeepVariant to PacBio HiFi reads to call variants. We will call small variants from a publicly available whole genome HiFi dataset from PacBio.
This case study involves a two-step process of variant calling. After the first round of calling, SNVs are phased and used to haplotag the input BAM. For the highest accuracy, variants are called again in a second pass. If somewhat lower Indel accuracy is acceptable in exchange for shorter run-time, the first-pass calls be used. Accuracy benchmarks for each pass are shown for this case study.
Singularity will be used to run DeepVariant and hap.py, and we'll use miniconda and a conda environment to handle the other dependencies for the case study, samtools and whatshap.
- singularity (must be installed by
root
user; outside of the scope of this case study) - samtools
- whatshap
# add channels to conda configuration
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# create the environment and install dependencies
conda create -y -n deepvariant_whatshap
conda activate deepvariant_whatshap
conda install -y whatshap==1.0 samtools==1.10
We will be using GRCh38 for this case study.
mkdir -p reference
# download and decompress
curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > reference/GRCh38_no_alt_analysis_set.fasta
# index reference
samtools faidx reference/GRCh38_no_alt_analysis_set.fasta
We will benchmark our variant calls against v4.2.1 of the Genome in a Bottle small variant benchmarks for HG003.
mkdir -p benchmark
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
We'll use HG003 chr20 HiFi reads publicly available from the PrecisionFDA Truth v2 Challenge.
mkdir -p input
HTTPDIR=https://downloads.pacbcloud.com/public/dataset/HG003/deepvariant-case-study
curl ${HTTPDIR}/HG003.GRCh38.chr20.pFDA_truthv2.bam > input/HG003.GRCh38.chr20.pFDA_truthv2.bam
curl ${HTTPDIR}/HG003.GRCh38.chr20.pFDA_truthv2.bam.bai > input/HG003.GRCh38.chr20.pFDA_truthv2.bam.bai
ulimit -u 10000 # https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable/54746150#54746150
BIN_VERSION="1.2.0"
mkdir -p deepvariant1
singularity exec --bind /usr/lib/locale/ \
docker://google/deepvariant:${BIN_VERSION} \
/opt/deepvariant/bin/run_deepvariant \
--model_type PACBIO \
--ref reference/GRCh38_no_alt_analysis_set.fasta \
--reads input/HG003.GRCh38.chr20.pFDA_truthv2.bam \
--output_vcf deepvariant1/output.vcf.gz \
--num_shards $(nproc) \
--regions chr20
mkdir -p whatshap
whatshap phase \
--output whatshap/deepvariant1.phased.vcf.gz \
--reference reference/GRCh38_no_alt_analysis_set.fasta \
--chromosome chr20 \
deepvariant1/output.vcf.gz \
input/HG003.GRCh38.chr20.pFDA_truthv2.bam
tabix -p vcf whatshap/deepvariant1.phased.vcf.gz
whatshap haplotag \
--output whatshap/HG003.GRCh38.chr20.haplotagged.bam \
--reference reference/GRCh38_no_alt_analysis_set.fasta \
whatshap/deepvariant1.phased.vcf.gz \
input/HG003.GRCh38.chr20.pFDA_truthv2.bam
samtools index whatshap/HG003.GRCh38.chr20.haplotagged.bam
ulimit -u 10000 # https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable/54746150#54746150
BIN_VERSION="1.2.0"
mkdir -p deepvariant2
singularity exec --bind /usr/lib/locale/ \
docker://google/deepvariant:${BIN_VERSION} \
/opt/deepvariant/bin/run_deepvariant \
--model_type PACBIO \
--ref reference/GRCh38_no_alt_analysis_set.fasta \
--reads whatshap/HG003.GRCh38.chr20.haplotagged.bam \
--use_hp_information \
--output_vcf deepvariant2/output.vcf.gz \
--num_shards $(nproc) \
--regions chr20
In order to allow DeepVariant to take advantage of the haplotype (HP) tags in
haplotagged BAMs, three flags must be passed to
/opt/deepvariant/bin/make_examples
:
--sort_by_haplotypes --parse_sam_aux_fields --add_hp_channel
For the convenient one-step command /opt/deepvariant/bin/run_deepvariant
, the
--use_hp_information
flag will provide these three flags to make_examples.
mkdir -p happy
singularity exec docker://jmcdani20/hap.py:v0.3.12 \
/opt/hap.py/bin/hap.py \
--threads $(nproc) \
-r reference/GRCh38_no_alt_analysis_set.fasta \
-f benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-o happy/giab-comparison.v4.2.first_pass \
--engine=vcfeval \
--pass-only \
-l chr20 \
benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
deepvariant1/output.vcf.gz
First pass output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10628 10474 154 21919 134 10874 81 39 0.985510 0.987868 0.496099 0.986687 NaN NaN 1.748961 2.411423
INDEL PASS 10628 10474 154 21919 134 10874 81 39 0.985510 0.987868 0.496099 0.986687 NaN NaN 1.748961 2.411423
SNP ALL 70166 70143 23 95142 44 24897 7 8 0.999672 0.999374 0.261683 0.999523 2.296566 1.963876 1.883951 2.048990
SNP PASS 70166 70143 23 95142 44 24897 7 8 0.999672 0.999374 0.261683 0.999523 2.296566 1.963876 1.883951 2.048990
mkdir -p happy
singularity exec docker://jmcdani20/hap.py:v0.3.12 \
/opt/hap.py/bin/hap.py \
--threads $(nproc) \
-r reference/GRCh38_no_alt_analysis_set.fasta \
-f benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-o happy/giab-comparison.v4.2.second_pass \
--engine=vcfeval \
--pass-only \
-l chr20 \
benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
deepvariant2/output.vcf.gz
Second pass output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10628 10562 66 22343 68 11272 37 30 0.993790 0.993858 0.504498 0.993824 NaN NaN 1.748961 2.475310
INDEL PASS 10628 10562 66 22343 68 11272 37 30 0.993790 0.993858 0.504498 0.993824 NaN NaN 1.748961 2.475310
SNP ALL 70166 70141 25 95598 17 25377 6 4 0.999644 0.999758 0.265455 0.999701 2.296566 1.952947 1.883951 2.057639
SNP PASS 70166 70141 25 95598 17 25377 6 4 0.999644 0.999758 0.265455 0.999701 2.296566 1.952947 1.883951 2.057639