Skip to content

The EnsembleVariantCallingPipeline takes files in FASTQ or BAM format and performs SNV and INDEL variant calling from 4 variant callers (Mutect2, Strelka2, Varscan2, MuSE) and indel variant calling from 3 variant callers (Mutect2, Strelka2, Varscan2).

License

Notifications You must be signed in to change notification settings

AlexandrovLab/EnsembleVariantCallingPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License

Note: This pipeline was designed to run on the Triton Shared Compute Cluster (TSCC) at the University of California - San Diego. Running it elsewhere will require modifying the source code.

EnsembleVariantCallingPipeline (v 1.0)

The EnsembleVariantCallingPipeline takes files in FASTQ or BAM format and performs SNV and INDEL variant calling from 4 variant callers (Mutect2, Strelka2, Varscan2, MuSE) and indel variant calling from 3 variant callers (Mutect2, Strelka2, Varscan2).

Pipeline structure

First time setup

  1. Clone the directory
git clone https://github.com/AlexandrovLab/EnsembleVariantCallingPipeline.git
cd EnsembleVariantCallingPipeline/
  1. Run the setup_env.sh script.
bash setup_env.sh

Running the pipeline from fastq

DO NOT move on if the previous step is still running. Each step must finish completely before moving on to the next step.

  1. Generate project directory structure. (takes <2 minutes)

    Run run_evc to create the jobs directory in the output directory. The pipeline will generate .pbs files for you to manually submit at each stage.

    run_evc \
    	path/to/fastq/files \
    	output/directory \
    	path/to/sample.map \
    	email.for@notification \
    	reference_genome (fasta) \
    	pon (INTERNAL_PON or provide a PON for Mutect) \
    	gnomad_dbSNP \
    	known_indel_list \
    	base_recalibration_list \
    	max_walltime (hours only) \
    	queue (hotel or home) \
    	interval_list_for_mutect \
    	Optional: run refine? (yes or no) [default: no]
    

    An exemplary suggested inputs:

    run_evc \
    	path/to/fastq/files \
    	output/directory \
    	path/to/sample.map \
    	email.for@notification \
    	/restricted/alexandrov-group/shared/Reference_Genomes/GRCh38.d1.vd1/GRCh38.d1.vd1.fa \
    	/projects/ps-lalexandrov/shared/Reference_Genomes/TCGA_GDC_files/tumor_normal_PONs/MuTect2.PON.5210.vcf.gz \
    	/restricted/alexandrov-group/shared/Reference_Genomes/dbSNP/af-only-gnomad.hg38_no_alt.vcf.gz \
    	/restricted/alexandrov-group/shared/Reference_Genomes/alignment_refinement/IndelRealignemnt_files.txt \
    	/restricted/alexandrov-group/shared/Reference_Genomes/alignment_refinement/BaseRecalibration_files.txt \
    	500 \
    	home-alexandrov \
    	interval_list_for_mutect \
    	no
    
  2. Alignment (takes 10 minutes to several hours depending on job queue traffic and the size of your fastq files)

    Performs alignment with bwa-mem and marks duplicates caused by PCR with Picard's (GATK's) MarkDuplicates.

    1. Navigate to the jobs/align directory in the output directory specified earlier. This directory contains the .pbs scripts used to run alignment on your fastq samples on TSCC.
    2. Submit all alignment jobs to the job queue.
      for pbs_file in *.pbs; do qsub ${pbs_file}; done
      

      This script will generate directories and perform alignment for each of your samples directly under output_dir. Within each of these, the script will generate tumor and normal _raw.bam files and then tumor and normal _mkdp.bam files.

    3. Check that your aligned `.bam` files were generated properly. A quick method of checking the integrity of every bam file quickly is by running `samtools quickcheck` on the `.bam` files (*Note: empty output means there were no errors*). Here is a script to quickly check all of your files:
      output_dir=[your output directory]
      for f in ${output_dir}/*/*_mkdp.bam; do echo Checking ${f}: ; samtools quickcheck $f; done
      
  3. (Optional) Refine alignment (takes several hours)

    You might want to skip this step if you believe your fastq files are of high quality since this step takes a long time to run. If you originally ran run_evc with refinement disabled, you can still run refinement by rerunning run_evc with the refinement option enabled.

    1. Navigate to the jobs/refine directory.
    2. Submit all _targetInterval.pbs files to job queue.
      for pbs_file in *_targetInterval.pbs; do qsub ${pbs_file}; done
      

      This script will subsequently submit the _Trefine.pbs and _Nrefine.pbs files.

    3. Once finished, you can check the refined bam files, which are named _final.bam, using the same program as after alignment samtools quickcheck.
      output_dir=[your output directory]
      for f in ${output_dir}/*/*_final.bam; do echo Checking ${f}: ; samtools quickcheck $f; done
      
  4. Generating the panel of normals (takes about 1 hour) - WIP. Does not work yet. Please pass in your own panel of normals or use an empty file as a placeholder.

    The panel of normals (PON) is useful in filtering out sequencing errors caused by limitations specific to the sequencing machine used. By comparing all normal samples in a batch that used the same sequening machine, this can identify recurrent mutations caused by technical artifacts from the machine. It can either be passed in through run_evc or generated from the bam files. The file is required to run Mutect2.

    1. Navigate to the jobs/pon directory.
    2. Submit all _pon.pbs jobs to the job queue.
      for pbs_file in *_pon.pbs; do qsub ${pbs_file}; done
      
    3. Combine all individual pons into a merged pon
      TODO: add code
      
  5. Run variant calling (takes several hours)

    The order of running the variant callers does not matter since they work independently. Filtering the .vcf files after variant calling is highly recommended.

    • Mutect2

      Submit the _mutect.pbs files under jobs/mutect.

      cd jobs/mutect
      for pbs_file in *_mutect.pbs; do qsub ${pbs_file}; done
      
    • Strelka2

      Submit the _strelka.pbs files under jobs/strelka.

      cd jobs/strelka
      for pbs_file in *_strelka.pbs; do qsub ${pbs_file}; done
      
    • Varscan2

      Submit the _varscan.pbs files under jobs/varscan.

      cd jobs/varscan
      for pbs_file in *_varscan.pbs; do qsub ${pbs_file}; done
      
    • MuSE

      Submit the _muse.pbs files under jobs/muse.

      cd jobs/muse
      for pbs_file in *_muse.pbs; do qsub ${pbs_file}; done
      

Running the pipeline starting with a bam file

DO NOT move on if the previous step is still running. Each step must finish completely before moving on to the next step.

  1. Generate project directory structure. (takes <2 minutes)

    Run run_evc_bam to create the jobs directory in the output directory. The pipeline will generate .pbs files for you to manually submit at each stage.

    run_evc_bam \
    	path/to/bam/files \
    	output/directory \
    	path/to/sample.map \
    	email.for@notification \
    	reference_genome (fasta) \
    	pon (INTERNAL_PON) \
    	gnomad_dbSNP \
    	max_walltime (hours only) \
    	queue (hotel or home) \
    	interval_list_for_mutect
    
  2. Follow the "Running from fastq" pipeline starting at step 5: Run variant calling

One shall submit the jobs at each stage step by step: (DO NOT move to the next step until the previous step is finished and verified.)

About

The EnsembleVariantCallingPipeline takes files in FASTQ or BAM format and performs SNV and INDEL variant calling from 4 variant callers (Mutect2, Strelka2, Varscan2, MuSE) and indel variant calling from 3 variant callers (Mutect2, Strelka2, Varscan2).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages