This is the detailed guideline of QUILT2. Also, check out the nifty lcWGS-imputation-workflow for automate and reproducible workflow.
- Split the genome into chunks
- Prepare the reference panel seperately
- Perform diploid imputation
- Perform NIPT imputation
- NIPT GWAS
- Evaluation and visualization
For many reasons, imputation is usually done for each genomic regions/chunks
independently with some overlapped buffer in between. For example, for
computational reason, we want to distribute our tasks into multiple machines
with limited memory. However, for the accuracy reason, we need to impute with
reasonable big chunks so that the model can work well. QUILT2 offers the
quilt_chunk_map
function for splitting the genome into multiple reasonable
chunks given the genetic map file. The function returns a table with the 3rd
column being the genomic region (in bcftools-like format) of each chunk.
dat <- QUILT::quilt_chunk_map("chr20", "package/CEU-chr20-final.b38.txt.gz")
str(dat)
## 'data.frame': 16 obs. of 3 variables:
## $ chunk : num 0 1 2 3 4 5 6 7 8 9 ...
## $ chr : chr "chr20" "chr20" "chr20" "chr20" ...
## $ region: chr "chr20:1-2998495" "chr20:2965894-5963047" ...
For large reference panels, and for many jobs involving imputing few samples,
it can be computationally efficient to pre-process the reference panel and
save the output, and use this output for multiple independent runs, which is
done via the which is done via QUILT2_prepare_reference.R
. Here is an example
for how we would do this, for the case of the quick start example.
./QUILT2_prepare_reference.R \
--genetic_map_file=package/CEU-chr20-final.b38.txt.gz \
--reference_vcf_file=package/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.noNA12878.vcf.gz \
--chr=chr20 \
--regionStart=2000001 \
--regionEnd=4000000 \
--nGen=100 \
--buffer=500000 \
--outputdir=quilt2_output
Input/Output files:
reference_vcf_panel
. QUILT2 now only supports VCF/BCF file with phased genotypes as the reference panel, thanks to vcfpp!.genetic_map_file
. File with genetic map information, with 3 white-space delimited columns giving position (1-based), genetic rate map in cM/Mbp, and genetic map in cM. Find the prepared ones in the maps folder.outputdir
. Directory for saving the output. The prepared reference data is saved asRData
that is in theoutputdir/RData
.
Find the more options via ./QUILT2_prepare_reference.R -h
.
Now given the saved reference data, we can perform imputation as follows. Note
that QUILT2 has option method=diploid
in default.
./QUILT2.R \
--prepared_reference_filename=quilt2_output/RData/QUILT_prepared_reference.chr20.2000001.4000000.RData \
--bamlist=package/bamlist.1.0.txt \
--method=diploid \
--chr=chr20 \
--regionStart=2000001 \
--regionEnd=4000000 \
--nGen=100 \
--buffer=500000 \
--output_filename=quilt2_output/quilt2.diploid.vcf.gz
Note that when running multiple versions of QUILT2 against the same reference
data, it is useful to set output_filename to change the default filename for
each job, and to keep the temporary directories used independent (which is the
behaviour for default tempdir
).
For a full list of options, query ?QUILT::QUILT
in R, or alternatively, type ./QUILT2.R --help
in terminal
These parameters are most likely to influence run time and accuracy
nGibbsSamples
. Number of Gibbs samples performed. Run time is approximately linear in this, and increasing it will increase accuracy, but with diminishing returns. Generally less important for higher coverage.n_seek_its
. Number of iterations between Gibbs sampling on small reference panel, and imputing using the full reference panel to find good matches for the small reference panel. Run time is approximately linear in this. Setting it higher will increase accuracy but with diminishing returns.Ksubset
. Size of the small reference panel. Higher values might increase accuracy but will increase run time.
- VCF with both SNP annotation information (see below) and per-sample genotype information. Per-sample genotype information includes the following entries
- GT (Phased genotypes) Phased genotype, where each allele is the rounded per-haplotype posterior probability (HD below)
- GP (Genotype posteriors) Posterior probabilities of the three genotypes given the data
- DS (Diploid dosage) Posterior expectation of the diploid genotype i.e. the expected number of copies of the alternate allele
- HD (Haploid dosages) Per-haplotype posterior probability of an alternate allele
If one runs QUILT2 with chunks determined above by quilt_chunk_map
, which
will create overlapping region for successive chunks, then the ligation of
multiple VCF files for diploid results can be done using bcftools
. Note that
this does not change the HD entry (haplotype dosages) which thereafter is
not useable. The GP (genotype posterior) and DS (dosage) entries remain
useable as they are phase invariant.
bcftools concat \
--ligate \
--output-type z \
--output quilt2.diploid.chr20.ligate.vcf.gz \
quilt2_output/quilt2.diploid.chr20.chunk0.vcf.gz \
quilt2_output/quilt2.diploid.chr20.chunk1.vcf.gz \
quilt2_output/quilt2.diploid.chr20.chunk2.vcf.gz
To perform imputation for NIPT data with QUILT2, we need to set method=nipt
and provide a file with estimated fetal fraction for each sample.
./QUILT2.R \
--prepared_reference_filename=quilt2_output/RData/QUILT_prepared_reference.chr20.2000001.4000000.RData \
--bamlist=package/bamlist.1.0.txt \
--fflist=package/fflist.1.0.txt \
--method=nipt \
--chr=chr20 \
--regionStart=2000001 \
--regionEnd=4000000 \
--nGen=100 \
--buffer=500000 \
--output_filename=quilt2_output/quilt2.nipt.vcf.gz
Since we are imputing both maternal and fetal genotypes, the outputted VCF is totally different from the diploid output!
First of all, the VCF header gives explanation on the FORMAT fields.
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes in order of maternal transmitted, maternal untransmitted, and fetal transmitted">
##FORMAT=<ID=MGP,Number=3,Type=Float,Description="Maternal Posterior genotype probability of 0/0, 0/1, and 1/1">
##FORMAT=<ID=MDS,Number=1,Type=Float,Description="Maternal Diploid dosage">
##FORMAT=<ID=FGP,Number=3,Type=Float,Description="Fetal Posterior genotype probability of 0/0, 0/1, and 1/1">
##FORMAT=<ID=FDS,Number=1,Type=Float,Description="Fetal Diploid dosage">
The per-sample genotype information includes the following entries:
- GT (Phased Genotypes). Phased genotype for three haplotypes in order of the maternal transmitted, maternal untransmitted, fetal transmitted.
- MGP (Maternal genotype posteriors). Maternal posterior probabilities of the three genotypes given the data and fetal fraction.
- MDS (Maternal genotype dosage). Maternal posterior expectation of the diploid genotype i.e. the expected number of copies of the alternate allele
- FGP (Fetal genotype posteriors). Fetal posterior probabilities of the three genotypes given the data and fetal fraction.
- FDS (Fetal genotype dosage). Maternal posterior expectation of the diploid genotype i.e. the expected number of copies of the alternate allele
Here we have two genomes that forms four haplotypes, while the QUILT2-nipt
only outputs 3 haplotypes in GT
. If one does not care about the phased
genotypes of both mother and fetus, then we can just concatenate the chunks
and remove one of the duplicated variants from the overlapping region, which
can be done as follows.
bcftools concat \
quilt2_output/quilt2.nipt.chr20.chunk0.vcf.gz \
quilt2_output/quilt2.nipt.chr20.chunk1.vcf.gz \
quilt2_output/quilt2.nipt.chr20.chunk2.vcf.gz | \
bcftools norm -D
--output-type z \
--output quilt2.nipt.chr20.concat.vcf.gz
Note: the above will result in meaningless phased GT across chunks. If we want to have proper ligated phased genotypes across the chunks for both mother and fetus, we need to create two separate VCF for mother and fetus with phased genotypes in each chunk using the vcf_nipt utility.
## outputs two VCFs: chunk0.mat.vcf.gz and chunk0.fet.vcf.gz
vcf_nipt -i quilt2_output/quilt2.nipt.chr20.chunk0.vcf.gz -o chunk0
Then we can perform ligation normally as above for the diploid genome.
QUILT2 outputs both maternal and fetal dosage in a single VCF, which may be not the standard that one used for GWAS. Here is a simple guide on how to perform GWAS on NIPT imputation results.
- extract maternal genotype dosages
bcftools query -f '%CHROM:%POS, %REF, %ALT[, %MDS]\n' $niptvcf \
| gzip -c > mat.bimbam.gz
- extract fetal genotype dosages
bcftools query -f '%CHROM:%POS, %REF, %ALT[, %FDS]\n' $niptvcf \
| gzip -c > fet.bimbam.gz
- perform lmm model
gemma -lmm 3 \
-notsnp \ ## not SNP filtering
-g $genfile \ ## mat.bimbam.gz or fet.bibam.gz
-p $phenofile \ ## phenotype file
-c $covfile \ ## covariant file
-k $grmfile \ ## GRM file generated via gemma -gk 1
-outdir $outdir \ ## output dir
-n $i ## which column in the phenotype file to be analysed
-o output$i ## output prefix
- extract maternal genotype dosages
plink2 --vcf $niptvcf dosage=MDS --make-pgen --out mat.plink2
- extract fetal genotype dosages
plink2 --vcf $niptvcf dosage=FDS --make-pgen --out fet.plink2
- perform glm model
plink2 --pfile fet.plink2 \ ## fetal or maternal dosages in pgen format
--pheno $phenofile \ ## phenotype file
--pheno-name $pheno \ ## which phenotype to be analysed
--covar $covfile \ ## covariant file
--covar-name age sex PC1-PC10 \ ## which covariants used
--glm hide-covar \ ## glm model and disable output covar
--out $output
Assume we have the truth genotypes data, we then can evaluate and visualize the imputation results conveniently via vcfppR.
library(vcfppR)
res <- vcfcomp(test = imputedvcf, truth = truthvcf,
stats = "r2", region = "chr20",
formats = c("DS","GT"))
par(mar=c(5,5,2,2), cex.lab = 2)
vcfplot(res, col = 2,cex = 2, lwd = 3, type = "b")