EnsembleVariantCallingPipeline (v 1.0)

Note: This pipeline was designed to run on the Triton Shared Compute Cluster (TSCC) at the University of California - San Diego. Running it elsewhere will require modifying the source code.

EnsembleVariantCallingPipeline (v 1.0)

The EnsembleVariantCallingPipeline takes files in FASTQ or BAM format and performs SNV and INDEL variant calling from 4 variant callers (Mutect2, Strelka2, Varscan2, MuSE) and indel variant calling from 3 variant callers (Mutect2, Strelka2, Varscan2).

Pipeline structure

First time setup

Clone the directory

git clone https://github.com/AlexandrovLab/EnsembleVariantCallingPipeline.git
cd EnsembleVariantCallingPipeline/

Run the setup_env.sh script.

bash setup_env.sh

Running the pipeline from fastq

DO NOT move on if the previous step is still running. Each step must finish completely before moving on to the next step.

Generate project directory structure. (takes <2 minutes)

Run run_evc to create the jobs directory in the output directory. The pipeline will generate .pbs files for you to manually submit at each stage.

run_evc \
	path/to/fastq/files \
	output/directory \
	path/to/sample.map \
	email.for@notification \
	reference_genome (fasta) \
	pon (INTERNAL_PON or provide a PON for Mutect) \
	gnomad_dbSNP \
	known_indel_list \
	base_recalibration_list \
	max_walltime (hours only) \
	queue (hotel or home) \
	interval_list_for_mutect \
	Optional: run refine? (yes or no) [default: no]

An exemplary suggested inputs:

run_evc \
	path/to/fastq/files \
	output/directory \
	path/to/sample.map \
	email.for@notification \
	/restricted/alexandrov-group/shared/Reference_Genomes/GRCh38.d1.vd1/GRCh38.d1.vd1.fa \
	/projects/ps-lalexandrov/shared/Reference_Genomes/TCGA_GDC_files/tumor_normal_PONs/MuTect2.PON.5210.vcf.gz \
	/restricted/alexandrov-group/shared/Reference_Genomes/dbSNP/af-only-gnomad.hg38_no_alt.vcf.gz \
	/restricted/alexandrov-group/shared/Reference_Genomes/alignment_refinement/IndelRealignemnt_files.txt \
	/restricted/alexandrov-group/shared/Reference_Genomes/alignment_refinement/BaseRecalibration_files.txt \
	500 \
	home-alexandrov \
	interval_list_for_mutect \
	no

Alignment (takes 10 minutes to several hours depending on job queue traffic and the size of your fastq files)
Performs alignment with bwa-mem and marks duplicates caused by PCR with Picard's (GATK's) MarkDuplicates.
1. Navigate to the jobs/align directory in the output directory specified earlier. This directory contains the .pbs scripts used to run alignment on your fastq samples on TSCC.
2. Submit all alignment jobs to the job queue.
```
for pbs_file in *.pbs; do qsub ${pbs_file}; done
```
  This script will generate directories and perform alignment for each of your samples directly under output_dir. Within each of these, the script will generate tumor and normal _raw.bam files and then tumor and normal _mkdp.bam files.
3. Check that your aligned `.bam` files were generated properly. A quick method of checking the integrity of every bam file quickly is by running `samtools quickcheck` on the `.bam` files (*Note: empty output means there were no errors*). Here is a script to quickly check all of your files:
```
output_dir=[your output directory]
for f in ${output_dir}/*/*_mkdp.bam; do echo Checking ${f}: ; samtools quickcheck $f; done
```
(Optional) Refine alignment (takes several hours)
You might want to skip this step if you believe your fastq files are of high quality since this step takes a long time to run. If you originally ran run_evc with refinement disabled, you can still run refinement by rerunning run_evc with the refinement option enabled.
1. Navigate to the jobs/refine directory.
2. Submit all _targetInterval.pbs files to job queue.
```
for pbs_file in *_targetInterval.pbs; do qsub ${pbs_file}; done
```
  This script will subsequently submit the _Trefine.pbs and _Nrefine.pbs files.
3. Once finished, you can check the refined bam files, which are named _final.bam, using the same program as after alignment samtools quickcheck.
```
output_dir=[your output directory]
for f in ${output_dir}/*/*_final.bam; do echo Checking ${f}: ; samtools quickcheck $f; done
```
Generating the panel of normals (takes about 1 hour) - WIP. Does not work yet. Please pass in your own panel of normals or use an empty file as a placeholder.
The panel of normals (PON) is useful in filtering out sequencing errors caused by limitations specific to the sequencing machine used. By comparing all normal samples in a batch that used the same sequening machine, this can identify recurrent mutations caused by technical artifacts from the machine. It can either be passed in through run_evc or generated from the bam files. The file is required to run Mutect2.
1. Navigate to the jobs/pon directory.
2. Submit all _pon.pbs jobs to the job queue.
```
for pbs_file in *_pon.pbs; do qsub ${pbs_file}; done
```
3. Combine all individual pons into a merged pon
```
TODO: add code
```
Run variant calling (takes several hours)
The order of running the variant callers does not matter since they work independently. Filtering the .vcf files after variant calling is highly recommended.
- Mutect2
  Submit the _mutect.pbs files under jobs/mutect.
```
cd jobs/mutect
for pbs_file in *_mutect.pbs; do qsub ${pbs_file}; done
```
- Strelka2
  Submit the _strelka.pbs files under jobs/strelka.
```
cd jobs/strelka
for pbs_file in *_strelka.pbs; do qsub ${pbs_file}; done
```
- Varscan2
  Submit the _varscan.pbs files under jobs/varscan.
```
cd jobs/varscan
for pbs_file in *_varscan.pbs; do qsub ${pbs_file}; done
```
- MuSE
  Submit the _muse.pbs files under jobs/muse.
```
cd jobs/muse
for pbs_file in *_muse.pbs; do qsub ${pbs_file}; done
```

Running the pipeline starting with a bam file

DO NOT move on if the previous step is still running. Each step must finish completely before moving on to the next step.

Generate project directory structure. (takes <2 minutes)

Run run_evc_bam to create the jobs directory in the output directory. The pipeline will generate .pbs files for you to manually submit at each stage.

run_evc_bam \
	path/to/bam/files \
	output/directory \
	path/to/sample.map \
	email.for@notification \
	reference_genome (fasta) \
	pon (INTERNAL_PON) \
	gnomad_dbSNP \
	max_walltime (hours only) \
	queue (hotel or home) \
	interval_list_for_mutect

Follow the "Running from fastq" pipeline starting at step 5: Run variant calling

One shall submit the jobs at each stage step by step: (DO NOT move to the next step until the previous step is finished and verified.)

Name		Name	Last commit message	Last commit date
Latest commit History 506 Commits
MuSE @ dc4d332		MuSE @ dc4d332
concatenate		concatenate
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.md		LICENSE.md
README.md		README.md
align_check.sh		align_check.sh
align_check_template.sh		align_check_template.sh
align_template.sh		align_template.sh
gatk3_env.yml		gatk3_env.yml
main_env.yml		main_env.yml
muse_template.sh		muse_template.sh
mutect_template.sh		mutect_template.sh
pon_template.sh		pon_template.sh
refine_check.sh		refine_check.sh
refine_check_template.sh		refine_check_template.sh
refine_template.sh		refine_template.sh
run_evc.sh		run_evc.sh
run_evc_bam.sh		run_evc_bam.sh
run_evc_precancer.sh		run_evc_precancer.sh
setup_MuSE.sh		setup_MuSE.sh
setup_env.sh		setup_env.sh
start_align.sh		start_align.sh
strelka_env.yml		strelka_env.yml
strelka_template.sh		strelka_template.sh
targetInterval_template.sh		targetInterval_template.sh
varscan_template.sh		varscan_template.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EnsembleVariantCallingPipeline (v 1.0)

Pipeline structure

First time setup

Running the pipeline from fastq

Running the pipeline starting with a bam file

About

Releases

Packages

Contributors 4

Languages

License

AlexandrovLab/EnsembleVariantCallingPipeline

Folders and files

Latest commit

History

Repository files navigation

EnsembleVariantCallingPipeline (v 1.0)

Pipeline structure

First time setup

Running the pipeline from fastq

Running the pipeline starting with a bam file

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages