This repo contains workflows for computational pathogen discovery using PathSeq, a pipeline in the Genome Analysis Toolkit (GATK) for detecting microbial organisms in short-read deep sequencing samples taken from a host organism.
Additional Resources:
- How to Run the Pathseq pipeline (manually)
- GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts
Runs the PathSeq pipeline
- BAM
- File must pass validation by ValidateSamFile
- All reads must have an RG tag
- One or more read groups all belong to a single sample (SM)
- Host and microbe references files available in the GATK Resource Bundle
- BAM file containing microbe-mapped reads and reads of unknown sequence
- Tab-separated value (.tsv) file of taxonomic abundance scores
- Picard-style metrics files for the filter and scoring phases of the pipeline
Builds a microbe reference for use with PathSeq
- FASTA file containing microbe sequences from NCBI RefSeq
- FASTA index and dictionary files
- GATK BWA-MEM index image
- PathSeq taxonomy file
Builds a host reference for use with PathSeq
- FASTA file containing host sequences
- FASTA index and dictionary files
- GATK BWA-MEM index image
- PathSeq Kmer file
- GATK 4 or later
- Cromwell version support
- Successfully tested on v36
- Does not work on versions < v23 due to output syntax
- Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
- The provided JSON is a ready to use example JSON template of the workflow. Users are responsible for reviewing the GATK Tool and Tutorial Documentations to properly set the reference and resource variables.
- For help running workflows on the Google Cloud Platform or locally please view the following tutorial (How to) Execute Workflows from the gatk-workflows Git Organization.
- Please visit the User Guide site for further documentation on our workflows and tools.
- Relevant reference and resources bundles can be accessed in Resource Bundle.
- The following material is provided by the Data Science Platforum group at the Broad Institute. Please direct any questions or concerns to one of our forum sites : GATK or Terra.
This script is released under the WDL source code license (BSD-3) (see LICENSE in https://github.com/broadinstitute/wdl). Note however that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.