Skip to content

Commit

Permalink
v0.4.0 release of bettercallsal.
Browse files Browse the repository at this point in the history
  • Loading branch information
biocoder committed Mar 17, 2023
1 parent b45fc0b commit aa8c024
Show file tree
Hide file tree
Showing 20 changed files with 716 additions and 176 deletions.
120 changes: 11 additions & 109 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# `bettercallsal`

`bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogen Detection](https://www.ncbi.nlm.nih.gov/pathogens) Project for [Salmonella](https://www.ncbi.nlm.nih.gov/pathogens/isolates/#taxgroup_name:%22Salmonella%20enterica%22). It uses `MASH` to reduce the search space for genome based alignment with `kma` followed by count generation using `salmon`. This workflow can be used to analyze shotgun metagenomics datasets, quasi-metagenomic datasets (enriched for Salmonella) and target enriched datasets (enriched with molecular baits specific for Salmonella) and is especially useful in a case where a sample is of multi-serovar mixture.
`bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogen Detection](https://www.ncbi.nlm.nih.gov/pathogens) Project for [Salmonella](https://www.ncbi.nlm.nih.gov/pathogens/isolates/#taxgroup_name:%22Salmonella%20enterica%22). It uses `MASH` to reduce the search space followed by additional genome filtering with `sourmash`. It then performs genome based alignment with `kma` followed by count generation using `salmon`. This workflow can be used to analyze shotgun metagenomics datasets, quasi-metagenomic datasets (enriched for Salmonella) and target enriched datasets (enriched with molecular baits specific for Salmonella) and is especially useful in a case where a sample is of multi-serovar mixture.

It is written in **Nextflow** and is part of the modular data analysis pipelines (**CFSAN PIPELINES** or **CPIPES** for short) at **CFSAN**.

Expand Down Expand Up @@ -30,7 +30,13 @@ We gratefully acknowledge all data contributors, i.e., the Authors and their Ori
### Citing `bettercallsal`

---
This work is currently unpublished. If you are making use of this analysis pipeline, we would appreciate if you credit this repository.
This work is currently unpublished. If you are making use of this analysis pipeline, we would appreciate if you credit this repository while citing us (tentative):

>
>**bettercallsal: Towards precise detection of Salmonella serotypes from enrichment cultures using shotgun metagenomic profiling and its application in an outbreak setting**
>
>Kranti Konganti, Elizabeth Reed, Mark Mammel, Tunc Kayikcioglu, Rachel Binet, Karen Jarvis, Christina M. Ferreira, Rebecca Bell, Jie Zheng, Amanda M. Windsor, Andrea Ottesen, Christopher Grim, and Padmini Ramachandran. *<https://github.com/CFSAN-Biostatistics/bettercallsal>*
>
\
&nbsp;
Expand All @@ -39,10 +45,10 @@ This work is currently unpublished. If you are making use of this analysis pipel

---

- The main workflow has not yet been fully validated and must be utilized for **research purposes** only.
- The main workflow has been used for **research purposes** only.
- Analysis results should be interpreted with caution and should be treated as suspect, as the pipeline is dependent on the precision of metadata from the **NCBI Pathogen Detection** project for the `per_snp_cluster` and `per_computed_serotype` databases.
- Detection threshold i.e sequencing depth has not yet been established for `bettercallsal` analysis workflow and therefore **No genome hit** assignment should be interpreted with caution.
- Multiple Salmonella serotype assignments also should be dealt with caution as this pipeline has not been tested on samples with 3 or more serovar mixture.
- Internal research with simulated datasets suggests that the `bettercallsal` workflow is more accurate with increased read depth, ideally, at least 5 million read pairs (PE) or 10 million reads (SE) per sample. That being said, it is not a hard-cutoff and you can still try the workflow on low read-depth samples.
- **No genome hit** assignment should be interpreted with caution.

\
&nbsp;
Expand All @@ -51,107 +57,3 @@ This work is currently unpublished. If you are making use of this analysis pipel

---
**CFSAN, FDA** assumes no responsibility whatsoever for use by other parties of the Software, its source code, documentation or compiled or uncompiled executables, and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic. Further, **CFSAN, FDA** makes no representations that the use of the Software will not infringe any patent or proprietary rights of third parties. The use of this code in no way implies endorsement by the **CFSAN, FDA** or confers any advantage in regulatory decisions.

\
&nbsp;

### Minimum Requirements

---

1. [Nextflow version 22.10.0](https://github.com/nextflow-io/nextflow/releases/download/v22.10.0/nextflow).
- Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
- If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK): [Corretto](https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.tar.gz).
2. Either of `micromamba` or `docker` or `singularity` installed and made available in your `$PATH`.
- Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
- To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation.html#manual-installation) and make sure that the `micromamba` binary is made available in your `$PATH`.
- Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
3. Minimum of 10 CPUs and about 64 GB for main workflow steps. More memory may be required if your **FASTQ** files are big.

\
&nbsp;

### Workflow Usage

---
Clone or download this repository and then call `cpipes`.

Following is the example of how to run the `bettercallsal` pipeline using `conda` for software provisioning. This requires that the `micromamba` executable be available in your `$PATH`.

```bash
cpipes --pipeline bettercallsal --enable-conda -with-conda [options]
```

Example:

```bash
cd /data/scratch/$USER
mkdir nf-cpipes
cd nf-cpipes
cpipes \
--pipeline bettercallsal \
--input /path/to/fastq_pass_dir \
--output /path/to/where/output/should/go \
-profile your_institution
```

The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-bettercallsal` would hold all the **NEXTFLOW** related logs, reports and trace files.

\
&nbsp;

### `your_institution.config`

---

In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](./conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:

```groovy
your_institution {
process.executor = 'sge'
process.queue = 'normal.q'
singularity.enabled = false
singularity.autoMounts = true
docker.enabled = false
params.enable_conda = true
conda.enabled = true
conda.useMicromamba = true
params.enable_module = false
}
```

In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `bettercallsal` workflow will request the appropriate memory and number of CPUs automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPUs, 1 TB and 120 hours for job completion.

\
&nbsp;

### Cloud computing

---

You can theoritically run the workflow in the cloud (not yet tested). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):

Example:

```groovy
my_aws_batch {
executor = 'awsbatch'
queue = 'my-batch-queue'
aws.$batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
aws.$batch.region = 'us-east-1'
singularity.enabled = false
singularity.autoMounts = true
docker.enabled = true
params.conda_enabled = false
params.enable_module = false
}
```

\
&nbsp;

### Output

---

All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `bettercallsal-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report.
6 changes: 3 additions & 3 deletions bin/sourmash_filter_hits.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

# Kranti Konganti

import os
import argparse
import gzip
import inspect
import logging
import re
import os
import pprint
import gzip
import re

# Set logging.
logging.basicConfig(
Expand Down
5 changes: 5 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
plugins {
id 'nf-amazon'
}

params {
fs = File.separator
cfsanpipename = 'CPIPES'
Expand All @@ -15,6 +19,7 @@ params {
tracereportsdir = "${launchDir}${params.fs}${cfsanpipename}-${params.pipeline}${params.fs}nextflow-reports"
dummyfile = "${projectDir}${params.fs}assets${params.fs}dummy_file.txt"
dummyfile2 = "${projectDir}${params.fs}assets${params.fs}dummy_file2.txt"
max_cpus = 10
linewidth = 80
pad = 32
pipeline = null
Expand Down
17 changes: 16 additions & 1 deletion conf/computeinfra.config
Original file line number Diff line number Diff line change
Expand Up @@ -132,4 +132,19 @@ kondagac {
conda.useMicromamba = true
params.enable_module = false
clusterOptions = '-n 1 --signal B:USR2'
}
}

cfsanawsbatch {
process.executor = 'awsbatch'
process.queue = 'cfsan-nf-batch-job-queue'
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
aws.batch.region = 'us-east-1'
aws.batch.volumes = ['/hpc/db:/hpc/db:ro', '/hpc/scratch:/hpc/scratch:rw']
singularity.enabled = false
singularity.autoMounts = true
docker.enabled = true
params.enable_conda = false
conda.enabled = false
conda.useMicromamba = false
params.enable_module = false
}
1 change: 1 addition & 0 deletions conf/logtheseparams.config
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,6 @@ params {
"${params.fq_filename_delim_idx}" ? 'fq_filename_delim_idx' : null,
'enable_conda',
'enable_module',
'max_cpus'
]
}
22 changes: 14 additions & 8 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -23,19 +23,19 @@ process {
}

withLabel: 'process_pico' {
cpus = { 2 * task.attempt }
cpus = { min_cpus(2) * task.attempt }
memory = { 4.GB * task.attempt }
time = { 2.h * task.attempt }
}

withLabel: 'process_nano' {
cpus = { 4 * task.attempt }
cpus = { min_cpus(4) * task.attempt }
memory = { 8.GB * task.attempt }
time = { 4.h * task.attempt }
}

withLabel: 'process_micro' {
cpus = { 8 * task.attempt }
cpus = { min_cpus(8) * task.attempt }
memory = { 16.GB * task.attempt }
time = { 8.h * task.attempt }
}
Expand All @@ -59,31 +59,31 @@ process {
}

withLabel: 'process_low' {
cpus = { 10 * task.attempt }
cpus = { min_cpus(10) * task.attempt }
memory = { 60.GB * task.attempt }
time = { 20.h * task.attempt }
}

withLabel: 'process_medium' {
cpus = { 10 * task.attempt }
cpus = { min_cpus(10) * task.attempt }
memory = { 100.GB * task.attempt }
time = { 30.h * task.attempt }
}

withLabel: 'process_high' {
cpus = { 10 * task.attempt }
cpus = { min_cpus(10) * task.attempt }
memory = { 128.GB * task.attempt }
time = { 60.h * task.attempt }
}

withLabel: 'process_higher' {
cpus = { 10 * task.attempt }
cpus = { min_cpus(10) * task.attempt }
memory = { 256.GB * task.attempt }
time = { 60.h * task.attempt }
}

withLabel: 'process_gigantic' {
cpus = { 10 * task.attempt }
cpus = { min_cpus(10) * task.attempt }
memory = { 512.GB * task.attempt }
time = { 60.h * task.attempt }
}
Expand All @@ -110,3 +110,9 @@ def dynamic_retry(task_retry_num, factor_by) {
sleep(Math.pow(1.27, task_retry_num.toInteger()) as long)
return 'retry'
}

// Function that will adjust the minimum number of CPU
// cores depending as requested by the user.
def min_cpus(cores) {
return Math.min(cores as int, "${params.max_cpus}" as int)
}
22 changes: 21 additions & 1 deletion lib/help/sfhpy.nf
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,33 @@ def sfhpyHelp(params) {
cliflag: null,
clivalue: null
],
'sfhpy_fcn': [
clihelp: 'Column name by which filtering of rows should be applied. ' +
"Default: ${params.sfhpy_fcn}",
cliflag: '-fcn',
clivalue: (params.sfhpy_fcn ?: '')
],
'sfhpy_fcv': [
clihelp: 'Remove genomes whose match with the query FASTQ is less than ' +
'this much. ' +
"Default: ${params.sfhpy_fcv}",
cliflag: '-fcv',
clivalue: (params.sfhpy_fcv ?: '')
]
],
'sfhpy_gt': [
clihelp: 'Apply greather than or equal to condition on numeric values of ' +
'--sfhpy_fcn column. ' +
"Default: ${params.sfhpy_gt}",
cliflag: '-gt',
clivalue: (params.sfhpy_gt ? ' ' : '')
],
'sfhpy_lt': [
clihelp: 'Apply less than or equal to condition on numeric values of ' +
'--sfhpy_fcn column. ' +
"Default: ${params.sfhpy_lt}",
cliflag: '-gt',
clivalue: (params.sfhpy_lt ? ' ' : '')
],
]

toolspecs.each {
Expand Down
Loading

0 comments on commit aa8c024

Please sign in to comment.