Skip to content

Latest commit

 

History

History
72 lines (62 loc) · 3.26 KB

ligation.Md

File metadata and controls

72 lines (62 loc) · 3.26 KB

Ligation example

Here we walk through an example of imputing a few chunks, then ligating the results together, so that we have both a VCF with dosage information and consistent phasing information.

Downloading example data

Here we download some example data to use. Note that the code below is just a fancy way of ensuring it should download on either mac or linux, otherwise throw an error.

path=http://www.stats.ox.ac.uk/~rdavies/QUILT_example_2021_01_15B.tgz
file=`basename ${path}`
if [ ! -f ${file} ]
then
    wget ${path} || true
    curl -O ${path} || true
    tar -xzvf ${file}
fi
if [ ! -f ${file} ]
then
    exit 1
fi

Impute chunks

Here we impute four chunks with buffers as usual, but include an overlap region. The overlap region is used in the ligation that follows.


for i in $(seq 0 3)
do
    regionStart=`echo $((2000001 + $i * 500000 - 100000))`
    regionEnd=`echo $((2500000 + $i * 500000 + 100000))`  
    ./QUILT.R \
        --outputdir=quilt_output \
        --chr=chr20 \
        --regionStart=${regionStart} \
        --regionEnd=${regionEnd} \
        --buffer=200000 \
        --bamlist=package_2021_01_15B/bamlist.1.0.txt \
        --posfile=package_2021_01_15B/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.posfile.txt \
        --phasefile=package_2021_01_15B/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.phasefile.txt \
         --reference_haplotype_file=package_2021_01_15B/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.noNA12878.hap.gz \
        --reference_legend_file=package_2021_01_15B/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.noNA12878.legend.gz \
        --genetic_map_file=package_2021_01_15B/CEU-chr20-final.b38.txt.gz \
        --nGen=100 \
	--nCores=3 ## optional
done

Ligate results

Here we ligate the results together using bcftools (see the bcftools site for installation). This ligation process uses the overlapping region to determine if the phased haplotypes are given in the correct orientation or not between two overlapping files, and flips the orientation in the second one if necessary. Note that this does not change the HD entry (haplotype dosages) which thereafter is not useable. The GP (genotype posterior) and DS (dosage) entries remain useable as they are phase invariant.

bcftools concat \
    --ligate \
    --output-type z \
    --output quilt.chr20.2000001.4000000.concat.vcf.gz \
    quilt_output/quilt.chr20.1900001.2600000.vcf.gz \
    quilt_output/quilt.chr20.2400001.3100000.vcf.gz \
    quilt_output/quilt.chr20.2900001.3600000.vcf.gz \
    quilt_output/quilt.chr20.3400001.4100000.vcf.gz

Note that if desired, one can make VCFs with only phase information after imputation using code like the following. These can be ligated similarly to the above.

bcftools view -h quilt_output/quilt.chr20.1900001.2600000.vcf.gz > quilt_output/quilt.chr20.1900001.2600000.onlygt.vcf
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%INFO\tGT\t[%GT\t]\n' quilt_output/quilt.chr20.1900001.2600000.vcf.gz >> quilt_output/quilt.chr20.1900001.2600000.onlygt.vcf
bgzip quilt_output/quilt.chr20.1900001.2600000.onlygt.vcf
tabix quilt_output/quilt.chr20.1900001.2600000.onlygt.vcf.gz