Here we walk through an example of imputing a few chunks, then ligating the results together, so that we have both a VCF with dosage information and consistent phasing information.
Here we download some example data to use. Note that the code below is just a fancy way of ensuring it should download on either mac or linux, otherwise throw an error.
path=http://www.stats.ox.ac.uk/~rdavies/QUILT_example_2021_01_15B.tgz
file=`basename ${path}`
if [ ! -f ${file} ]
then
wget ${path} || true
curl -O ${path} || true
tar -xzvf ${file}
fi
if [ ! -f ${file} ]
then
exit 1
fi
Here we impute four chunks with buffers as usual, but include an overlap region. The overlap region is used in the ligation that follows.
for i in $(seq 0 3)
do
regionStart=`echo $((2000001 + $i * 500000 - 100000))`
regionEnd=`echo $((2500000 + $i * 500000 + 100000))`
./QUILT.R \
--outputdir=quilt_output \
--chr=chr20 \
--regionStart=${regionStart} \
--regionEnd=${regionEnd} \
--buffer=200000 \
--bamlist=package_2021_01_15B/bamlist.1.0.txt \
--posfile=package_2021_01_15B/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.posfile.txt \
--phasefile=package_2021_01_15B/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.phasefile.txt \
--reference_haplotype_file=package_2021_01_15B/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.noNA12878.hap.gz \
--reference_legend_file=package_2021_01_15B/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.noNA12878.legend.gz \
--genetic_map_file=package_2021_01_15B/CEU-chr20-final.b38.txt.gz \
--nGen=100 \
--nCores=3 ## optional
done
Here we ligate the results together using bcftools
(see the bcftools site for installation). This ligation process uses the overlapping region to determine if the phased haplotypes are given in the correct orientation or not between two overlapping files, and flips the orientation in the second one if necessary.
Note that this does not change the HD entry (haplotype dosages) which thereafter is not useable. The GP (genotype posterior) and DS (dosage) entries remain useable as they are phase invariant.
bcftools concat \
--ligate \
--output-type z \
--output quilt.chr20.2000001.4000000.concat.vcf.gz \
quilt_output/quilt.chr20.1900001.2600000.vcf.gz \
quilt_output/quilt.chr20.2400001.3100000.vcf.gz \
quilt_output/quilt.chr20.2900001.3600000.vcf.gz \
quilt_output/quilt.chr20.3400001.4100000.vcf.gz
Note that if desired, one can make VCFs with only phase information after imputation using code like the following. These can be ligated similarly to the above.
bcftools view -h quilt_output/quilt.chr20.1900001.2600000.vcf.gz > quilt_output/quilt.chr20.1900001.2600000.onlygt.vcf
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%INFO\tGT\t[%GT\t]\n' quilt_output/quilt.chr20.1900001.2600000.vcf.gz >> quilt_output/quilt.chr20.1900001.2600000.onlygt.vcf
bgzip quilt_output/quilt.chr20.1900001.2600000.onlygt.vcf
tabix quilt_output/quilt.chr20.1900001.2600000.onlygt.vcf.gz