v0.4.0 release of bettercallsal.

CFSAN-Biostatistics · Mar 17, 2023 · aa8c024 · aa8c024
1 parent b45fc0b
commit aa8c024
Show file tree

Hide file tree

Showing 20 changed files with 716 additions and 176 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # `bettercallsal`
 
-`bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogen Detection](https://www.ncbi.nlm.nih.gov/pathogens) Project for [Salmonella](https://www.ncbi.nlm.nih.gov/pathogens/isolates/#taxgroup_name:%22Salmonella%20enterica%22). It uses `MASH` to reduce the search space for genome based alignment with `kma` followed by count generation using `salmon`. This workflow can be used to analyze shotgun metagenomics datasets, quasi-metagenomic datasets (enriched for Salmonella) and target enriched datasets (enriched with molecular baits specific for Salmonella) and is especially useful in a case where a sample is of multi-serovar mixture.
+`bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogen Detection](https://www.ncbi.nlm.nih.gov/pathogens) Project for [Salmonella](https://www.ncbi.nlm.nih.gov/pathogens/isolates/#taxgroup_name:%22Salmonella%20enterica%22). It uses `MASH` to reduce the search space followed by additional genome filtering with `sourmash`. It then performs genome based alignment with `kma` followed by count generation using `salmon`. This workflow can be used to analyze shotgun metagenomics datasets, quasi-metagenomic datasets (enriched for Salmonella) and target enriched datasets (enriched with molecular baits specific for Salmonella) and is especially useful in a case where a sample is of multi-serovar mixture.
 
 It is written in **Nextflow** and is part of the modular data analysis pipelines (**CFSAN PIPELINES** or **CPIPES** for short) at **CFSAN**.
 
@@ -30,7 +30,13 @@ We gratefully acknowledge all data contributors, i.e., the Authors and their Ori
 ### Citing `bettercallsal`
 
 ---
-This work is currently unpublished. If you are making use of this analysis pipeline, we would appreciate if you credit this repository.
+This work is currently unpublished. If you are making use of this analysis pipeline, we would appreciate if you credit this repository while citing us (tentative):
+
+>
+>**bettercallsal: Towards precise detection of Salmonella serotypes from enrichment cultures using shotgun metagenomic profiling and its application in an outbreak setting**
+>
+>Kranti Konganti, Elizabeth Reed, Mark Mammel, Tunc Kayikcioglu, Rachel Binet, Karen Jarvis, Christina M. Ferreira, Rebecca Bell, Jie Zheng, Amanda M. Windsor, Andrea Ottesen, Christopher Grim, and Padmini Ramachandran. *<https://github.com/CFSAN-Biostatistics/bettercallsal>*
+>
 
 \
 &nbsp;
@@ -39,10 +45,10 @@ This work is currently unpublished. If you are making use of this analysis pipel
 
 ---
 
-- The main workflow has not yet been fully validated and must be utilized for **research purposes** only.
+- The main workflow has been used for **research purposes** only.
 - Analysis results should be interpreted with caution and should be treated as suspect, as the pipeline is dependent on the precision of metadata from the **NCBI Pathogen Detection** project for the `per_snp_cluster` and `per_computed_serotype` databases.
-- Detection threshold i.e sequencing depth has not yet been established for `bettercallsal` analysis workflow and therefore **No genome hit** assignment should be interpreted with caution.
-- Multiple Salmonella serotype assignments also should be dealt with caution as this pipeline has not been tested on samples with 3 or more serovar mixture.
+- Internal research with simulated datasets suggests that the `bettercallsal` workflow is more accurate with increased read depth, ideally, at least 5 million read pairs (PE) or 10 million reads (SE) per sample. That being said, it is not a hard-cutoff and you can still try the workflow on low read-depth samples.
+- **No genome hit** assignment should be interpreted with caution.
 
 \
 &nbsp;
@@ -51,107 +57,3 @@ This work is currently unpublished. If you are making use of this analysis pipel
 
 ---
 **CFSAN, FDA** assumes no responsibility whatsoever for use by other parties of the Software, its source code, documentation or compiled or uncompiled executables, and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic. Further, **CFSAN, FDA** makes no representations that the use of the Software will not infringe any patent or proprietary rights of third parties. The use of this code in no way implies endorsement by the **CFSAN, FDA** or confers any advantage in regulatory decisions.
-
-\
-&nbsp;
-
-### Minimum Requirements
-
----
-
-1. [Nextflow version 22.10.0](https://github.com/nextflow-io/nextflow/releases/download/v22.10.0/nextflow).
-    - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
-    - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK):  [Corretto](https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.tar.gz).
-2. Either of `micromamba` or `docker` or `singularity` installed and made available in your `$PATH`.
-    - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
-    - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation.html#manual-installation) and make sure that the `micromamba` binary is made available in your `$PATH`.
-    - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
-3. Minimum of 10 CPUs and about 64 GB for main workflow steps. More memory may be required if your **FASTQ** files are big.
-
-\
-&nbsp;
-
-### Workflow Usage
-
----
-Clone or download this repository and then call `cpipes`.
-
-Following is the example of how to run the `bettercallsal` pipeline using `conda` for software provisioning. This requires that the `micromamba` executable be available in your `$PATH`.
-
-```bash
-cpipes --pipeline bettercallsal --enable-conda -with-conda [options]
-```
-
-Example:
-
-```bash
-cd /data/scratch/$USER
-mkdir nf-cpipes
-cd nf-cpipes
-cpipes \
-    --pipeline bettercallsal \
-    --input /path/to/fastq_pass_dir \
-    --output /path/to/where/output/should/go \
-    -profile your_institution
-```
-
-The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-bettercallsal` would hold all the **NEXTFLOW** related logs, reports and trace files.
-
-\
-&nbsp;
-
-### `your_institution.config`
-
----
-
-In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](./conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:
-
-```groovy
-your_institution {
-    process.executor = 'sge'
-    process.queue = 'normal.q'
-    singularity.enabled = false
-    singularity.autoMounts = true
-    docker.enabled = false
-    params.enable_conda = true
-    conda.enabled = true
-    conda.useMicromamba = true
-    params.enable_module = false
-}
-```
-
-In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `bettercallsal` workflow will request the appropriate memory and number of CPUs automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPUs, 1 TB and 120 hours for job completion.
-
-\
-&nbsp;
-
-### Cloud computing
-
----
-
-You can theoritically run the workflow in the cloud (not yet tested). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):
-
-Example:
-
-```groovy
-my_aws_batch {
-    executor = 'awsbatch'
-    queue = 'my-batch-queue'
-    aws.$batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
-    aws.$batch.region = 'us-east-1'
-    singularity.enabled = false
-    singularity.autoMounts = true
-    docker.enabled = true
-    params.conda_enabled = false
-    params.enable_module = false
-}
-```
-
-\
-&nbsp;
-
-### Output
-
----
-
-All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `bettercallsal-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report.
diff --git a/bin/sourmash_filter_hits.py b/bin/sourmash_filter_hits.py
@@ -2,13 +2,13 @@
 
 # Kranti Konganti
 
-import os
 import argparse
+import gzip
 import inspect
 import logging
-import re
+import os
 import pprint
-import gzip
+import re
 
 # Set logging.
 logging.basicConfig(

diff --git a/conf/base.config b/conf/base.config
@@ -1,3 +1,7 @@
+plugins {
+    id 'nf-amazon'
+}
+
 params {
     fs = File.separator
     cfsanpipename = 'CPIPES'
@@ -15,6 +19,7 @@ params {
     tracereportsdir = "${launchDir}${params.fs}${cfsanpipename}-${params.pipeline}${params.fs}nextflow-reports"
     dummyfile = "${projectDir}${params.fs}assets${params.fs}dummy_file.txt"
     dummyfile2 = "${projectDir}${params.fs}assets${params.fs}dummy_file2.txt"
+    max_cpus = 10
     linewidth = 80
     pad = 32
     pipeline = null

diff --git a/conf/computeinfra.config b/conf/computeinfra.config
@@ -132,4 +132,19 @@ kondagac {
     conda.useMicromamba = true
     params.enable_module = false
     clusterOptions = '-n 1 --signal B:USR2'
-}
+}
+
+cfsanawsbatch {
+    process.executor = 'awsbatch'
+    process.queue = 'cfsan-nf-batch-job-queue'
+    aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
+    aws.batch.region = 'us-east-1'
+    aws.batch.volumes = ['/hpc/db:/hpc/db:ro', '/hpc/scratch:/hpc/scratch:rw']
+    singularity.enabled = false
+    singularity.autoMounts = true
+    docker.enabled = true
+    params.enable_conda = false
+    conda.enabled = false
+    conda.useMicromamba = false
+    params.enable_module = false
+}
diff --git a/conf/logtheseparams.config b/conf/logtheseparams.config
@@ -12,5 +12,6 @@ params {
         "${params.fq_filename_delim_idx}" ? 'fq_filename_delim_idx' : null,
         'enable_conda',
         'enable_module',
+        'max_cpus'
     ]
 }
diff --git a/conf/modules.config b/conf/modules.config
@@ -23,19 +23,19 @@ process {
     }
 
     withLabel: 'process_pico' {
-        cpus = { 2 * task.attempt }
+        cpus = { min_cpus(2) * task.attempt }
         memory = { 4.GB * task.attempt }
         time = { 2.h * task.attempt }
     }
 
     withLabel: 'process_nano' {
-        cpus = { 4 * task.attempt }
+        cpus = { min_cpus(4) * task.attempt }
         memory = { 8.GB * task.attempt }
         time = { 4.h * task.attempt }
     }
 
     withLabel: 'process_micro' {
-        cpus = { 8 * task.attempt }
+        cpus = { min_cpus(8) * task.attempt }
         memory = { 16.GB * task.attempt }
         time = { 8.h * task.attempt }
     }
@@ -59,31 +59,31 @@ process {
     }
 
     withLabel: 'process_low' {
-        cpus = { 10 * task.attempt }
+        cpus = { min_cpus(10) * task.attempt }
         memory = { 60.GB * task.attempt }
         time = { 20.h * task.attempt }
     }
 
     withLabel: 'process_medium' {
-        cpus = { 10 * task.attempt }
+        cpus = { min_cpus(10) * task.attempt }
         memory = { 100.GB * task.attempt }
         time = { 30.h * task.attempt }
     }
 
     withLabel: 'process_high' {
-        cpus = { 10 * task.attempt }
+        cpus = { min_cpus(10) * task.attempt }
         memory = { 128.GB * task.attempt }
         time = { 60.h * task.attempt }
     }
 
     withLabel: 'process_higher' {
-        cpus = { 10 * task.attempt }
+        cpus = { min_cpus(10) * task.attempt }
         memory = { 256.GB * task.attempt }
         time = { 60.h * task.attempt }
     }
 
     withLabel: 'process_gigantic' {
-        cpus = { 10 * task.attempt }
+        cpus = { min_cpus(10) * task.attempt }
         memory = { 512.GB * task.attempt }
         time = { 60.h * task.attempt }
     }
@@ -110,3 +110,9 @@ def dynamic_retry(task_retry_num, factor_by) {
     sleep(Math.pow(1.27, task_retry_num.toInteger()) as long)
     return 'retry'
 }
+
+// Function that will adjust the minimum number of CPU
+// cores depending as requested by the user.
+def min_cpus(cores) {
+    return Math.min(cores as int, "${params.max_cpus}" as int)
+}
diff --git a/lib/help/sfhpy.nf b/lib/help/sfhpy.nf
@@ -14,13 +14,33 @@ def sfhpyHelp(params) {
             cliflag: null,
             clivalue: null
         ],
+        'sfhpy_fcn': [
+            clihelp: 'Column name by which filtering of rows should be applied. ' +
+                "Default: ${params.sfhpy_fcn}",
+            cliflag: '-fcn',
+            clivalue: (params.sfhpy_fcn ?: '')
+        ],
         'sfhpy_fcv': [
             clihelp: 'Remove genomes whose match with the query FASTQ is less than ' +
                 'this much. ' +
                 "Default: ${params.sfhpy_fcv}",
             cliflag: '-fcv',
             clivalue: (params.sfhpy_fcv ?: '')
-        ]
+        ],
+        'sfhpy_gt': [
+            clihelp: 'Apply greather than or equal to condition on numeric values of ' +
+                '--sfhpy_fcn column. ' +
+                "Default: ${params.sfhpy_gt}",
+            cliflag: '-gt',
+            clivalue: (params.sfhpy_gt ? ' ' : '')
+        ],
+        'sfhpy_lt': [
+            clihelp: 'Apply less than or equal to condition on numeric values of ' +
+                '--sfhpy_fcn column. ' +
+                "Default: ${params.sfhpy_lt}",
+            cliflag: '-gt',
+            clivalue: (params.sfhpy_lt ? ' ' : '')
+        ],
     ]
 
     toolspecs.each {