Skip to content

Latest commit

 

History

History
270 lines (216 loc) · 11.7 KB

README_QUILT2.org

File metadata and controls

270 lines (216 loc) · 11.7 KB

Tutorial of QUILT2

This is the detailed guideline of QUILT2. Also, check out the nifty lcWGS-imputation-workflow for automate and reproducible workflow.

Table of Contents

Split the genome into chunks

For many reasons, imputation is usually done for each genomic regions/chunks independently with some overlapped buffer in between. For example, for computational reason, we want to distribute our tasks into multiple machines with limited memory. However, for the accuracy reason, we need to impute with reasonable big chunks so that the model can work well. QUILT2 offers the quilt_chunk_map function for splitting the genome into multiple reasonable chunks given the genetic map file. The function returns a table with the 3rd column being the genomic region (in bcftools-like format) of each chunk.

dat <- QUILT::quilt_chunk_map("chr20", "package/CEU-chr20-final.b38.txt.gz")
str(dat)
## 'data.frame':	16 obs. of  3 variables:
##  $ chunk : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ chr   : chr  "chr20" "chr20" "chr20" "chr20" ...
##  $ region: chr  "chr20:1-2998495" "chr20:2965894-5963047" ...

Prepare the reference panel seperately

For large reference panels, and for many jobs involving imputing few samples, it can be computationally efficient to pre-process the reference panel and save the output, and use this output for multiple independent runs, which is done via the which is done via QUILT2_prepare_reference.R. Here is an example for how we would do this, for the case of the quick start example.

./QUILT2_prepare_reference.R \
    --genetic_map_file=package/CEU-chr20-final.b38.txt.gz \
    --reference_vcf_file=package/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.noNA12878.vcf.gz \
    --chr=chr20 \
    --regionStart=2000001 \
    --regionEnd=4000000 \
    --nGen=100 \
    --buffer=500000 \
    --outputdir=quilt2_output

Input/Output files:

  • reference_vcf_panel. QUILT2 now only supports VCF/BCF file with phased genotypes as the reference panel, thanks to vcfpp!.
  • genetic_map_file. File with genetic map information, with 3 white-space delimited columns giving position (1-based), genetic rate map in cM/Mbp, and genetic map in cM. Find the prepared ones in the maps folder.
  • outputdir. Directory for saving the output. The prepared reference data is saved as RData that is in the outputdir/RData.

Find the more options via ./QUILT2_prepare_reference.R -h.

Perform diploid imputation

Now given the saved reference data, we can perform imputation as follows. Note that QUILT2 has option method=diploid in default.

./QUILT2.R \
--prepared_reference_filename=quilt2_output/RData/QUILT_prepared_reference.chr20.2000001.4000000.RData \
--bamlist=package/bamlist.1.0.txt \
--method=diploid \
--chr=chr20 \
--regionStart=2000001 \
--regionEnd=4000000 \
--nGen=100 \
--buffer=500000 \
--output_filename=quilt2_output/quilt2.diploid.vcf.gz

Note that when running multiple versions of QUILT2 against the same reference data, it is useful to set output_filename to change the default filename for each job, and to keep the temporary directories used independent (which is the behaviour for default tempdir).

For a full list of options, query ?QUILT::QUILT in R, or alternatively, type ./QUILT2.R --help in terminal

Important parameters that influence run time and accuracy

These parameters are most likely to influence run time and accuracy

  • nGibbsSamples. Number of Gibbs samples performed. Run time is approximately linear in this, and increasing it will increase accuracy, but with diminishing returns. Generally less important for higher coverage.
  • n_seek_its. Number of iterations between Gibbs sampling on small reference panel, and imputing using the full reference panel to find good matches for the small reference panel. Run time is approximately linear in this. Setting it higher will increase accuracy but with diminishing returns.
  • Ksubset. Size of the small reference panel. Higher values might increase accuracy but will increase run time.

Output of diploid imputation

  • VCF with both SNP annotation information (see below) and per-sample genotype information. Per-sample genotype information includes the following entries
  • GT (Phased genotypes) Phased genotype, where each allele is the rounded per-haplotype posterior probability (HD below)
  • GP (Genotype posteriors) Posterior probabilities of the three genotypes given the data
  • DS (Diploid dosage) Posterior expectation of the diploid genotype i.e. the expected number of copies of the alternate allele
  • HD (Haploid dosages) Per-haplotype posterior probability of an alternate allele

Ligate diploid genome

If one runs QUILT2 with chunks determined above by quilt_chunk_map, which will create overlapping region for successive chunks, then the ligation of multiple VCF files for diploid results can be done using bcftools. Note that this does not change the HD entry (haplotype dosages) which thereafter is not useable. The GP (genotype posterior) and DS (dosage) entries remain useable as they are phase invariant.

bcftools concat \
    --ligate \
    --output-type z \
    --output quilt2.diploid.chr20.ligate.vcf.gz \
    quilt2_output/quilt2.diploid.chr20.chunk0.vcf.gz \
    quilt2_output/quilt2.diploid.chr20.chunk1.vcf.gz \
    quilt2_output/quilt2.diploid.chr20.chunk2.vcf.gz 

Perform NIPT imputation

To perform imputation for NIPT data with QUILT2, we need to set method=nipt and provide a file with estimated fetal fraction for each sample.

./QUILT2.R \
--prepared_reference_filename=quilt2_output/RData/QUILT_prepared_reference.chr20.2000001.4000000.RData \
--bamlist=package/bamlist.1.0.txt \
--fflist=package/fflist.1.0.txt \
--method=nipt \
--chr=chr20 \
--regionStart=2000001 \
--regionEnd=4000000 \
--nGen=100 \
--buffer=500000 \
--output_filename=quilt2_output/quilt2.nipt.vcf.gz

Output of NIPT imputation

Since we are imputing both maternal and fetal genotypes, the outputted VCF is totally different from the diploid output!

First of all, the VCF header gives explanation on the FORMAT fields.

##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes in order of maternal transmitted, maternal untransmitted, and fetal transmitted">
##FORMAT=<ID=MGP,Number=3,Type=Float,Description="Maternal Posterior genotype probability of 0/0, 0/1, and 1/1">
##FORMAT=<ID=MDS,Number=1,Type=Float,Description="Maternal Diploid dosage">
##FORMAT=<ID=FGP,Number=3,Type=Float,Description="Fetal Posterior genotype probability of 0/0, 0/1, and 1/1">
##FORMAT=<ID=FDS,Number=1,Type=Float,Description="Fetal Diploid dosage">

The per-sample genotype information includes the following entries:

  • GT (Phased Genotypes). Phased genotype for three haplotypes in order of the maternal transmitted, maternal untransmitted, fetal transmitted.
  • MGP (Maternal genotype posteriors). Maternal posterior probabilities of the three genotypes given the data and fetal fraction.
  • MDS (Maternal genotype dosage). Maternal posterior expectation of the diploid genotype i.e. the expected number of copies of the alternate allele
  • FGP (Fetal genotype posteriors). Fetal posterior probabilities of the three genotypes given the data and fetal fraction.
  • FDS (Fetal genotype dosage). Maternal posterior expectation of the diploid genotype i.e. the expected number of copies of the alternate allele

Ligate NIPT results

Here we have two genomes that forms four haplotypes, while the QUILT2-nipt only outputs 3 haplotypes in GT. If one does not care about the phased genotypes of both mother and fetus, then we can just concatenate the chunks and remove one of the duplicated variants from the overlapping region, which can be done as follows.

bcftools concat \
    quilt2_output/quilt2.nipt.chr20.chunk0.vcf.gz \
    quilt2_output/quilt2.nipt.chr20.chunk1.vcf.gz \
    quilt2_output/quilt2.nipt.chr20.chunk2.vcf.gz | \
    bcftools norm -D 
    --output-type z \
    --output quilt2.nipt.chr20.concat.vcf.gz

Note: the above will result in meaningless phased GT across chunks. If we want to have proper ligated phased genotypes across the chunks for both mother and fetus, we need to create two separate VCF for mother and fetus with phased genotypes in each chunk using the vcf_nipt utility.

## outputs two VCFs: chunk0.mat.vcf.gz and chunk0.fet.vcf.gz
vcf_nipt -i  quilt2_output/quilt2.nipt.chr20.chunk0.vcf.gz -o chunk0

Then we can perform ligation normally as above for the diploid genome.

NIPT GWAS

QUILT2 outputs both maternal and fetal dosage in a single VCF, which may be not the standard that one used for GWAS. Here is a simple guide on how to perform GWAS on NIPT imputation results.

GEMMA BIMBAM format

  • extract maternal genotype dosages
bcftools query -f '%CHROM:%POS, %REF, %ALT[, %MDS]\n' $niptvcf \
         | gzip -c > mat.bimbam.gz
  • extract fetal genotype dosages
bcftools query -f '%CHROM:%POS, %REF, %ALT[, %FDS]\n' $niptvcf \
         | gzip -c > fet.bimbam.gz
  • perform lmm model
gemma -lmm 3 \
      -notsnp \  ## not SNP filtering
      -g $genfile \  ## mat.bimbam.gz or fet.bibam.gz
      -p $phenofile \  ## phenotype file 
      -c $covfile \   ## covariant file
      -k $grmfile \   ## GRM file generated via gemma -gk 1
      -outdir $outdir \  ## output dir 
      -n $i   ## which column in the phenotype file to be analysed
      -o output$i  ## output prefix  

PLINK2 format

  • extract maternal genotype dosages
plink2 --vcf $niptvcf dosage=MDS --make-pgen --out mat.plink2
  • extract fetal genotype dosages
plink2 --vcf $niptvcf dosage=FDS --make-pgen --out fet.plink2
  • perform glm model
plink2 --pfile fet.plink2 \  ## fetal or maternal dosages in pgen format
       --pheno $phenofile \  ## phenotype file
       --pheno-name $pheno \  ## which phenotype to be analysed
       --covar $covfile \    ## covariant file
       --covar-name age sex PC1-PC10 \  ## which covariants used
       --glm hide-covar \   ## glm model and disable output covar
       --out $output 

Evaluation and visualization

Assume we have the truth genotypes data, we then can evaluate and visualize the imputation results conveniently via vcfppR.

library(vcfppR)
res <- vcfcomp(test = imputedvcf, truth = truthvcf,
               stats = "r2", region = "chr20", 
               formats = c("DS","GT"))
par(mar=c(5,5,2,2), cex.lab = 2)
vcfplot(res, col = 2,cex = 2, lwd = 3, type = "b")