Note: This pipeline was designed to run on the Triton Shared Compute Cluster (TSCC) at the University of California - San Diego. Running it elsewhere will require modifying the source code.
The EnsembleVariantCallingPipeline takes files in FASTQ or BAM format and performs SNV and INDEL variant calling from 4 variant callers (Mutect2, Strelka2, Varscan2, MuSE) and indel variant calling from 3 variant callers (Mutect2, Strelka2, Varscan2).
- Clone the directory
git clone https://github.com/AlexandrovLab/EnsembleVariantCallingPipeline.git
cd EnsembleVariantCallingPipeline/
- Run the setup_env.sh script.
bash setup_env.sh
DO NOT move on if the previous step is still running. Each step must finish completely before moving on to the next step.
- Generate project directory structure. (takes <2 minutes)
Run
run_evc
to create thejobs
directory in the output directory. The pipeline will generate.pbs
files for you to manually submit at each stage.run_evc \ path/to/fastq/files \ output/directory \ path/to/sample.map \ email.for@notification \ reference_genome (fasta) \ pon (INTERNAL_PON or provide a PON for Mutect) \ gnomad_dbSNP \ known_indel_list \ base_recalibration_list \ max_walltime (hours only) \ queue (hotel or home) \ interval_list_for_mutect \ Optional: run refine? (yes or no) [default: no]
An exemplary suggested inputs:
run_evc \ path/to/fastq/files \ output/directory \ path/to/sample.map \ email.for@notification \ /restricted/alexandrov-group/shared/Reference_Genomes/GRCh38.d1.vd1/GRCh38.d1.vd1.fa \ /projects/ps-lalexandrov/shared/Reference_Genomes/TCGA_GDC_files/tumor_normal_PONs/MuTect2.PON.5210.vcf.gz \ /restricted/alexandrov-group/shared/Reference_Genomes/dbSNP/af-only-gnomad.hg38_no_alt.vcf.gz \ /restricted/alexandrov-group/shared/Reference_Genomes/alignment_refinement/IndelRealignemnt_files.txt \ /restricted/alexandrov-group/shared/Reference_Genomes/alignment_refinement/BaseRecalibration_files.txt \ 500 \ home-alexandrov \ interval_list_for_mutect \ no
- Alignment (takes 10 minutes to several hours depending on job queue traffic and the size of your fastq files)
Performs alignment with bwa-mem and marks duplicates caused by PCR with Picard's (GATK's) MarkDuplicates.
- Navigate to the
jobs/align
directory in the output directory specified earlier. This directory contains the.pbs
scripts used to run alignment on your fastq samples on TSCC. - Submit all alignment jobs to the job queue.
for pbs_file in *.pbs; do qsub ${pbs_file}; done
This script will generate directories and perform alignment for each of your samples directly under
output_dir
. Within each of these, the script will generate tumor and normal_raw.bam
files and then tumor and normal_mkdp.bam
files. - Check that your aligned `.bam` files were generated properly. A quick method of checking the integrity of every bam file quickly is by running `samtools quickcheck` on the `.bam` files (*Note: empty output means there were no errors*). Here is a script to quickly check all of your files:
output_dir=[your output directory] for f in ${output_dir}/*/*_mkdp.bam; do echo Checking ${f}: ; samtools quickcheck $f; done
- Navigate to the
- (Optional) Refine alignment (takes several hours)
You might want to skip this step if you believe your fastq files are of high quality since this step takes a long time to run. If you originally ran
run_evc
with refinement disabled, you can still run refinement by rerunningrun_evc
with the refinement option enabled.- Navigate to the
jobs/refine
directory. - Submit all
_targetInterval.pbs
files to job queue.for pbs_file in *_targetInterval.pbs; do qsub ${pbs_file}; done
This script will subsequently submit the
_Trefine.pbs
and_Nrefine.pbs
files. - Once finished, you can check the refined bam files, which are named
_final.bam
, using the same program as after alignmentsamtools quickcheck
.output_dir=[your output directory] for f in ${output_dir}/*/*_final.bam; do echo Checking ${f}: ; samtools quickcheck $f; done
- Navigate to the
- Generating the panel of normals (takes about 1 hour) - WIP. Does not work yet. Please pass in your own panel of normals or use an empty file as a placeholder.
The panel of normals (PON) is useful in filtering out sequencing errors caused by limitations specific to the sequencing machine used. By comparing all normal samples in a batch that used the same sequening machine, this can identify recurrent mutations caused by technical artifacts from the machine. It can either be passed in through
run_evc
or generated from the bam files. The file is required to run Mutect2.- Navigate to the
jobs/pon
directory. - Submit all
_pon.pbs
jobs to the job queue.for pbs_file in *_pon.pbs; do qsub ${pbs_file}; done
- Combine all individual pons into a merged pon
TODO: add code
- Navigate to the
- Run variant calling (takes several hours)
The order of running the variant callers does not matter since they work independently. Filtering the
.vcf
files after variant calling is highly recommended.- Mutect2
Submit the
_mutect.pbs
files underjobs/mutect
.cd jobs/mutect for pbs_file in *_mutect.pbs; do qsub ${pbs_file}; done
- Strelka2
Submit the
_strelka.pbs
files underjobs/strelka
.cd jobs/strelka for pbs_file in *_strelka.pbs; do qsub ${pbs_file}; done
- Varscan2
Submit the
_varscan.pbs
files underjobs/varscan
.cd jobs/varscan for pbs_file in *_varscan.pbs; do qsub ${pbs_file}; done
- MuSE
Submit the
_muse.pbs
files underjobs/muse
.cd jobs/muse for pbs_file in *_muse.pbs; do qsub ${pbs_file}; done
- Mutect2
DO NOT move on if the previous step is still running. Each step must finish completely before moving on to the next step.
- Generate project directory structure. (takes <2 minutes)
Run
run_evc_bam
to create thejobs
directory in the output directory. The pipeline will generate.pbs
files for you to manually submit at each stage.run_evc_bam \ path/to/bam/files \ output/directory \ path/to/sample.map \ email.for@notification \ reference_genome (fasta) \ pon (INTERNAL_PON) \ gnomad_dbSNP \ max_walltime (hours only) \ queue (hotel or home) \ interval_list_for_mutect
- Follow the "Running from fastq" pipeline starting at step 5: Run variant calling
One shall submit the jobs at each stage step by step: (DO NOT move to the next step until the previous step is finished and verified.)