Snakemake pipeline for processing GVCF files to generate multi-sample VCF of exon variants. The workflow is built for GVCFs based on hg19 alignment and outputs variants that are lifted over to hg38.
Current workflow is based on GATK tutorial for joint calling on consolidated GVCFs:
- Merge GVCFs using GenomicsDBImport for specfified intervals.
- Joint calling on merged GenomicsDB using GenotypeGVCFs.
- Extract variants (remove indels) using vcftools.
- Filter for exon variants using bcftools and bed file.
- Liftover to hg38 using CrossMap.
- Filter VCF by region and remove Ki27 contig.
Previous workflow extracted variants from each GVCF, lifted over to hg38, filtered by region and quality, and then finally merged into one vcf (old workflow in variants.smk
). This is deprecated due to loss of variants from independent filtering per sample. Assuming reference genotype for missing variants might have introduced artificial genotypes during the merging step.
Dependencies are managed by snakemake via conda environment. See envs/crossmap.yaml
for full list of dependencies installed via conda. Missing linux libraries may cause error during installing of CrossMap (ex. libcurl4-openssl-dev).
Install them with sudo apt-get install
command before continuing running snakemake pipeline.
GATK4 installation is required.
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository.
Clone this repository:
git clone https://github.com/racng/snakemake-merge-wgs.git
Prepare a samplesheet csv file with the individual sample ids and their pooled sample name. Column names are flexible but must be consistent with your configuration file.
Here is an example sample sheet, where PBMC sample origin
contains the names of individual samples, and sample name
contains the pooled sample label to group by. Note that multiple samples can come from the same patient, so we provide additional string parsing argument for extracting patient id (see regex below).
PBMC sample origin | sample name |
---|---|
INCOV001-1 | B1 |
INCOV002-1 | B1 |
INCOV003-2 | B2 |
INCOV004-1 | B2 |
1234BW | B2 |
Configure the workflow according to your needs via editing config.yaml
.
datadir
: Directory containing GVCFs formated as "{patient_id}.gvcf.gz".refdir
: Directory containing referencesoutdir
: Output directorysamples
: Path to sample sheet csv filepooled_sample_col
: Column name for pooled sample labels (ex. PBMC sample origin)individual_sample_col
: Column name for individual sample ids (ex. sample name)include
: Optional, specified list of pooled samples to process. Put empty list[]
if processing all pooled samples.regex
: Regex pattern for parsing patient id from the individual_sample_col in samplesheet. ex."(INCOV\\d{3}|\\d{4}BW)"
would parse outINCOV001, INCOV002, INCOV003, INCOV004, 1234BW
from the example sample sheet. If no string parsing is necessary, put"(.+)"
. Note that backslash are special characters for YAML files.regions
: regions/intervals to include when running GenomicsDBImport and bcftools filtering. ex."chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY"
threads
: Number of parallel threads allowed (used by GenomicsDBImport)
Install Snakemake using conda and mamba:
conda install -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation.
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally using $N
cores (or 'all' cores)
snakemake --use-conda --cores $N
Use --configfile
option to select a different config file.
To download packages to a dedicated folder for multiple runs, use the --conda-prefix
option.
See the Snakemake documentation for further details.
To run snakemake in the background and pipe the terminal output to file, wrap the above commands
nohup [INSERT SNAKEMAKE COMMAND] > log.txt & echo $! > pid.txt
Process ID is saved in pid.txt
.
Check background job and its jobid with jobs
. Bring back to foreground with fg [jobid]
.
- Rachel Ng (@racng) - Heath Lab, Institute for Systems Biology
- Chris Lausted - Hood Lab, Institute for Systems Biology