NCCR Genomics Pipeline

Under Development

Pipeline for analysis of isolate genomes.

Steps Currently Included:

Preprocessing
Isolate Genome Assembly
Gene calling and annotation
Mapping back to assembly or to reference genome
Variant calling and annotation
ANI calculation
Phylogenetics with PhyloPhlAn

Installation

The software depends on conda, Python >= 3.8 and snakemake == 5.22.

git clone https://github.com/MicrobiologyETHZ/NCCR_genomicsPipeline
cd NCCR_genomicsPipeline
conda env create -f nccrPipe_environment.yaml
# mamba env create -f nccrPipe_environment.yaml
conda activate nccrPipe
pip install -e .

Conda environments

Using pre-created environments.
Recipes could be found in ``
Create before running the pipeline [Write instructions]

Running the isolate pipeline

Usage: nccrPipe isolate [OPTIONS]

Options:
  -c, --config TEXT  Configuration File
  -m, --method TEXT  Workflow to run, options: [call_variants, assemble,
                     assemble_only]
  --local            Run on local machine
  --no-conda         Do not use conda, under construction, do not use
  --dry              Show commands without running them
  --help             Show this message and exit.

Running variant calling pipeline:

1. Testing the installation.

Run nccrPipe isolate -m call_variants --dry. This will do a dry run of variant calling pipeline on the test dataset. Run nccrPipe isolate -m call_variants This will run the variant calling pipeline on the test dataset Use --local if you don't the jobs to be submitted to the queuing system.

The first time you run, it will take a while to create the conda environments.

2. Creating a config file and sample file

To run the pipeline you need to provide a yaml config file. Example config file for sufficient for variant calling is shown below.

projectName: Test_Variant_Calling
# dataDir: directory with raw sequencing files, files for each sample should be stored in a subdirectory 
dataDir: test_data/varcall_test_data/raw
# outDir: output directory
outDir: test_data/varcall_test_data/output
# sampleFile: text file listing samples to be analysed
sampleFile: test_data/varcall_test_data/test_samples.txt
# forward and reverse fastq suffixes
fq_fwd: _R1.fq.gz
fq_rvr: _R2.fq.gz

# Align to reference
# Reference genome to call variants against
reference: test_data/varcall_test_data/LL6_1.fasta.gz

# The rest of the parameters don't have to be changed, copy and paste into your config file
# Preprocessing 
qc: yes # yes to perform preprocessing, no to skip

# BBMap paramerters
mink: 11
trimq: 14
mapq: 20
minlen: 45

merged: false # By default do not use merge for isolate genome assembly
fastqc: no # Run fastqc or not. Options: no, before, after, both. 


# Standard parameters. 
adapters: ../../../data/adapters/adapters.fa
phix: ../../../data/adapters/phix174_ill.ref.fa.gz

Example structure for the dataDir:

.
└── dataDir/
    ├── Sample1/
    │   ├── Sample1_ABCD_R1.fq.gz
    │   └── Sample1_EFGH_R2.fq.gz
    └── Sample2/
        ├── Sample2_ABCD_R1.fq.gz
        └── Sample2_EFGH_R2.fq.gz

In which case, sampleFile would look like this:

Sample1
Sample2

3. Running variant calling pipeline. (Recommened to run with `--dry` option first)

nccrPipe isolate -c <full/path/to/your_config.yaml> -m call_variants

4. Running genome assembly pipeline.

nccrPipe isolate -c <full/path/to/your_config.yaml> -m assemble

Troubleshooting

Log files, stdout and stderr files for each step of the pipeline can be found in outDir/logs/

Running RNAseq pipeline

To run STAR/featureCounts pipeline run:

nccrPipe rnaseq -c <configfile> -m star

To see what jobs are going to be submitted to the cluster add --dry flag. To run pipeline without submitting jobs to the cluster add --local flag.

To run kallisto pipeline run:


nccrPipe rnaseq -c <configfile> -m kallisto

Running isolate pipeline

UNDER CONSTRUCTION

RNAseq Config File:

YAML file with following mandatory fields:

ProjectName: Test
dataDir: path to data directory (see data structure below)
outDir: path to output directory
sampleFile: file with sample names (see example below)
fq_fwd: _1.fq.gz # forward reads fastq suffix
fq_rvr: _2.fq.gz # reverse reads fastq suffix

#Preprocessing
qc: yes
mink: 11
trimq: 14 # 37
mapq: 20
minlen: 45

merged: false # By default to do not use merge for isolate genome assembly
fastqc: no # Options: no, before, after, both


# STAR
refGenome: reference_genome.fna # Uncompressed (Did not test with compressed file)
refAnn: reference_annotation.gtf # Important to have .gtf not a .gff
genomeDir: directory to output genome index
overhang: 149 # ReadLength - 1
maxIntron: 50000 # Max size of intron (Depends on organism) 


# featureCounts
strand: 0 # Strandiness of the RNASeq, can be 0,1,2
attribute: gene_id # Should work, if using .gtf file 


# kallisto
transcriptome: transcriptome.fa #Uncompressed
kallistoIdx: transcriptome_index_file.idx


# Standard parameters. Dont change these!!!
adapters: '/nfs/cds-shini.ethz.ch/exports/biol_micro_cds_gr_sunagawa/SequenceStorage/resources/adapters.fa'
phix: '/nfs/cds-shini.ethz.ch/exports/biol_micro_cds_gr_sunagawa/SequenceStorage/resources/phix174_ill.ref.fa.gz'
bbmap_human_ref: '/nfs/cds-shini.ethz.ch/exports/biol_micro_cds_gr_sunagawa/SequenceStorage/resources/human_bbmap_ref/'
bbmap_mouse_ref: '/nfs/cds-shini.ethz.ch/exports/biol_micro_cds_gr_sunagawa/SequenceStorage/resources/mouse_bbmap_ref/'

Example Data Structure

|--data/
     |--samples.txt   
     |--raw/
     |   |--Sample1/
     |   |      |--Sample1_abcd_R1.fq.gz
     |   |      |--Sample1_abcd_R2.fq.gz
     |   |
     |   |--Sample2/
     |          |--Sample1_efgh_R1.fq.gz
     |          |--Sample2_efgh_R2.fq.gz 
     |   
     |--processed/

Example samples.txt:

Sample1
Sample2

Example config

ProjectName: Example_Project
dataDir: data/raw
outDir: data/processed
sampleFile: data/samples.txt
fq_fwd: _R1.fq.gz 
fq_rvr: _R2.fq.gz 

...

Example RNASeq config is code/package/nccrPipe/configs/rnaseq_config.yaml
Example samples file is code/package/nccrPipe/configs/rnaseq_config.yaml
RNAseq snakemake rules are in code/package/nccrPipe/rnaseq_rules

UNDER CONSTRUCTION

Analysis Options:

preprocess
merge_fastq
assemble
quastCheck
plasmid
annotate
nucmer
runANI
phylogeny
findAb
pileup
align
align_with_ref
call_vars
call_vars2 -> use this one
call_vars3
markdup
type
serotype
anVar
anVar2
pangenome

All of these need to be tested and documented

Example

Default data structure:

By default, data will be in data/raw and the output directory data/processed

cd project
nccrPipe -a create_config -c code/configs/project_config.yaml

Variant Calling

General Steps:

Align reads to reference genome with BWA.
Remove duplicates with GATK MarkDuplicates
Run bcftools mpileup + bcftools call
bcftools filter

-g5    filter SNPs within 5 base pairs of an indel or other other variant type
-G10   filter clusters of indels separated by 10 or fewer base pairs allowing only one to pass
-e  exclude
QUAL<10 calls with quality score < 10
DP4[2]<10 || DP4[3]<10  calls with < 10 reads (forward and reverse) covering the variant
(DP4[2] + DP4[3])/sum(DP4) < 0.9 calls with allele frequence < 90 %
MQ<50 calls with average mapping quality < 50

Add filter based on coverage? Regions with really high coverage generally contain lots of artifacts.

Filters:

Annotation:

Still a little complicated. Adding new genome to the snpEff database not integrated into the main pipeline.
Created conda environment within the pipeline: .snakemake/conda/
If want to add new genome, run

nccrPipe -c <config_file> -a addGenomeSnpEff

Config file has to include path to the gbk (and ideally fasta) files, as well as the name of the genome. The chromosome names between the files have to match.
If that doesn't fail, can run

nccrPipe -c <config_file> -a anVar

a

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
configs		configs
data/adapters		data/adapters
workflow		workflow
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
nccrPipe_environment.yaml		nccrPipe_environment.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NCCR Genomics Pipeline

Steps Currently Included:

Installation

Conda environments

Running the isolate pipeline

Running variant calling pipeline:

1. Testing the installation.

2. Creating a config file and sample file

3. Running variant calling pipeline. (Recommened to run with `--dry` option first)

4. Running genome assembly pipeline.

Troubleshooting

Running RNAseq pipeline

Running isolate pipeline

RNAseq Config File:

Example Data Structure

Example samples.txt:

Example config

UNDER CONSTRUCTION

Analysis Options:

Example

Variant Calling

General Steps:

Filters:

Annotation:

About

Releases

Packages

Contributors 2

Languages

MicrobiologyETHZ/NCCR_genomicsPipeline

Folders and files

Latest commit

History

Repository files navigation

NCCR Genomics Pipeline

Steps Currently Included:

Installation

Conda environments

Running the isolate pipeline

Running variant calling pipeline:

1. Testing the installation.

2. Creating a config file and sample file

3. Running variant calling pipeline. (Recommened to run with --dry option first)

4. Running genome assembly pipeline.

Troubleshooting

Running RNAseq pipeline

Running isolate pipeline

RNAseq Config File:

Example Data Structure

Example samples.txt:

Example config

UNDER CONSTRUCTION

Analysis Options:

Example

Variant Calling

General Steps:

Filters:

Annotation:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

3. Running variant calling pipeline. (Recommened to run with `--dry` option first)

Packages