Skip to content

Commit

Permalink
Make linter happy
Browse files Browse the repository at this point in the history
  • Loading branch information
itrujnara committed Dec 4, 2024
1 parent c10af50 commit 0a24860
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 58 deletions.
64 changes: 32 additions & 32 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,67 +5,67 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## v1.0.0dev

__The content below is the unaltered changelog of the unreleased 2020 version of the pipeline.__
**The content below is the unaltered changelog of the unreleased 2020 version of the pipeline.**

## v0.1.0dev - [date]

Initial release of nf-core/kmermaid, created with the [nf-core](https://nf-co.re/) template.

### `Added`

* Add option to use Dayhoff encoding for sourmash.
* Add `bam2fasta` process to kmermaid pipeline and flags involved.
* Add `extract_coding` and `peptide_bloom_filter` process and flags involved.
* Add `track_abundance` feature to keep track of hashed kmer frequency.
* Add social preview image
* Add `fastp` process for trimming reads
* Add option to use compressed `.tgz` file containing output from 10X Genomics' `cellranger count` outputs, including `possorted_genome_bam.bam` and `barcodes.tsv` files
* Add samtools_fastq_unaligned and samtools_fastq_aligned process for converting bam to per cell
barcode fastq
* Add version printing for sencha, bam2fasta, and sourmash in Dockerfile, update versions in environment.yml
* For processes translate, sourmash compute add cpus=1 as they are only serial ([#107](https://github.com/nf-core/kmermaid/pull/107))
* Add `sourmash sig merge` for aligned/unaligned signatures from bam files, and add `--skip_sig_merge` option to turn it off
* Add `--protein_fastas` option for creating sketches of already-translated protein sequences
* Add `--skip_compare option` to skip `sourmash_compare_sketches` process
* Add merging of aligned/unaligned parts of single-cell data ([#117](https://github.com/nf-core/kmermaid/pull/117))
* Add renamed package dependency orpheum (used to be known as sencha)
- Add option to use Dayhoff encoding for sourmash.
- Add `bam2fasta` process to kmermaid pipeline and flags involved.
- Add `extract_coding` and `peptide_bloom_filter` process and flags involved.
- Add `track_abundance` feature to keep track of hashed kmer frequency.
- Add social preview image
- Add `fastp` process for trimming reads
- Add option to use compressed `.tgz` file containing output from 10X Genomics' `cellranger count` outputs, including `possorted_genome_bam.bam` and `barcodes.tsv` files
- Add samtools_fastq_unaligned and samtools_fastq_aligned process for converting bam to per cell
barcode fastq
- Add version printing for sencha, bam2fasta, and sourmash in Dockerfile, update versions in environment.yml
- For processes translate, sourmash compute add cpus=1 as they are only serial ([#107](https://github.com/nf-core/kmermaid/pull/107))
- Add `sourmash sig merge` for aligned/unaligned signatures from bam files, and add `--skip_sig_merge` option to turn it off
- Add `--protein_fastas` option for creating sketches of already-translated protein sequences
- Add `--skip_compare option` to skip `sourmash_compare_sketches` process
- Add merging of aligned/unaligned parts of single-cell data ([#117](https://github.com/nf-core/kmermaid/pull/117))
- Add renamed package dependency orpheum (used to be known as sencha)

### `Fixed`

#### Resources

* Increase CPUs in `high_memory_long` profile from 1 to 10
- Increase CPUs in `high_memory_long` profile from 1 to 10

#### Naming

* Rename splitkmer to `split_kmer`
- Rename splitkmer to `split_kmer`

#### Per-cell fastqs and bams

* Remove `one_signature_per_record` flag and add bam2fasta count_umis_percell and make_fastqs_percell instead of bam2fasta sharding method
* Use ripgrep instead of bam2fasta to make per-cell fastq, which will hopefully make resuming long-running pipelines on bams much faster
* Make sure `samtools_fastq_aligned` outputs ALL aligned reads, regardless of mapping quality or primary alignment status
- Remove `one_signature_per_record` flag and add bam2fasta count_umis_percell and make_fastqs_percell instead of bam2fasta sharding method
- Use ripgrep instead of bam2fasta to make per-cell fastq, which will hopefully make resuming long-running pipelines on bams much faster
- Make sure `samtools_fastq_aligned` outputs ALL aligned reads, regardless of mapping quality or primary alignment status

#### Sourmash

* add `--skip_compute option` to skip `sourmash_compute_sketch_*`
* Used `.combine()` instead of `each` to do cartesian product of all possible molecules, ksizes, and sketch values
* Do `sourmash compute` on all input ksizes, and all peptide molecule types, at once to save disk reading/writing efforts
- add `--skip_compute option` to skip `sourmash_compute_sketch_*`
- Used `.combine()` instead of `each` to do cartesian product of all possible molecules, ksizes, and sketch values
- Do `sourmash compute` on all input ksizes, and all peptide molecule types, at once to save disk reading/writing efforts

#### Translate

* Updated sencha=1.0.3 to fix the bug in memory errors possibly with the numpy array on unique filenames ([PR #96 on orpheum](https://github.com/czbiohub/orpheum/pull/96))
* Add option to write non-coding nucleotide sequences fasta files while doing sencha translate
* Don't save translate csvs and jsons by default, add separate `--save_translate_json` and `--save_translate_csv`
* Updated `sencha translate` default parameters to be `--ksize 8 --jaccard-threshold 0.05` because those were the most successful
* Update renaming of `khtools` commands to `sencha`
- Updated sencha=1.0.3 to fix the bug in memory errors possibly with the numpy array on unique filenames ([PR #96 on orpheum](https://github.com/czbiohub/orpheum/pull/96))
- Add option to write non-coding nucleotide sequences fasta files while doing sencha translate
- Don't save translate csvs and jsons by default, add separate `--save_translate_json` and `--save_translate_csv`
- Updated `sencha translate` default parameters to be `--ksize 8 --jaccard-threshold 0.05` because those were the most successful
- Update renaming of `khtools` commands to `sencha`

#### MultiQC

* Fix the use of `skip_multiqc` flag condition with if and not when
- Fix the use of `skip_multiqc` flag condition with if and not when

### `Dependencies`

### `Deprecated`

* Removed ability to specify multiple `--scaled` or `--num-hashes` values to enable merging of signatures
- Removed ability to specify multiple `--scaled` or `--num-hashes` values to enable merging of signatures
10 changes: 4 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@

## Introduction

**nf-core/kmermaid** is a bioinformatics pipeline that performs comparative analysis of *omes using k-mer based methods. It supports various reference and sequencing input formats, and provides statistics files along with a MultiQC report as output. It provides pre-processing methods for reads and alignments.
**nf-core/kmermaid** is a bioinformatics pipeline that performs comparative analysis of \*omes using k-mer based methods. It supports various reference and sequencing input formats, and provides statistics files along with a MultiQC report as output. It provides pre-processing methods for reads and alignments.

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->

In the outline below, every step except for the main analysis is optional and might be input-dependent.

__Optional – BAM preprocessing__
**Optional – BAM preprocessing**

1. Extract BAM from 10X archive (`tar`)
2. Extract FASTQ reads ([`samtools`](http://www.htslib.org/))
Expand All @@ -35,22 +35,20 @@ __Optional – BAM preprocessing__

5. Download SRA experiment () [optional]

__Optional – read preprocessing__
**Optional – read preprocessing**

6. Trim reads ([`fastp`](https://github.com/OpenGene/fastp))
7. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
8. Remove rRNA ([`sortmerna`](https://github.com/sortmerna/sortmerna))
9. Translate to protein ([`orpheum`](https://github.com/czbiohub-sf/orpheum))

__k-mer analysis per method__
**k-mer analysis per method**

10. Create sketch
11. Calculate distances

12. Present the results ([`MultiQC`](http://multiqc.info/))



## Usage

### With a samples.csv file
Expand Down
40 changes: 20 additions & 20 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ The directories listed below will be created in the results directory after the
The pipeline is built using [Nextflow](https://www.nextflow.io/)
and processes data using the following steps:

* [FastQC](#fastqc) - read quality control
* [MultiQC](#multiqc) - Aggregate report describing results from the whole pipeline
* [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
* [Sourmash Sketch](#sourmash-sketch) - Compute a k-mer sketch of each sample
* [Sourmash Compare](#sourmash-compare) - Compare all samples on k-mer sketches
- [FastQC](#fastqc) - read quality control
- [MultiQC](#multiqc) - Aggregate report describing results from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
- [Sourmash Sketch](#sourmash-sketch) - Compute a k-mer sketch of each sample
- [Sourmash Compare](#sourmash-compare) - Compare all samples on k-mer sketches

## FastQC

Expand All @@ -31,10 +31,10 @@ For further reading and documentation see the [FastQC help pages](http://www.bio
**Output files:**

* `fastqc/`
* `*_fastqc.html`: FastQC report containing quality metrics for your untrimmed raw fastq files.
* `fastqc/zips/`
* `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
- `fastqc/`
- `*_fastqc.html`: FastQC report containing quality metrics for your untrimmed raw fastq files.
- `fastqc/zips/`
- `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.

> **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
Expand All @@ -46,7 +46,7 @@ For further reading and documentation see the [FastQC help pages](http://www.bio

For each sample and provided `molecules`, `ksizes` and `sketch_num_hashes_log2`, a file is created:

* `sample_molecule-${molecule}__ksize-${ksize}__${sketch_value}__track_abundance-${track_abundance}.sig`
- `sample_molecule-${molecule}__ksize-${ksize}__${sketch_value}__track_abundance-${track_abundance}.sig`

For example:

Expand All @@ -71,8 +71,8 @@ SRR4050379_molecule-protein_ksize-9_sketch_num_hashes_log2-4.sig

For each provided `molecules`, `ksizes` and `sketch_num_hashes_log2`, a file is created containing a symmetric matrix of the similarity between all samples, written as a comma-separated variable file:

* `similarities_molecule-${molecule}__ksize-${ksize}__${sketch_value}__track_abundance-${track_abundance}.csv`
For example,
- `similarities_molecule-${molecule}__ksize-${ksize}__${sketch_value}__track_abundance-${track_abundance}.csv`
For example,

```bash
similarities_molecule-dna_ksize-9_sketch_num_hashes_log2-4.csv
Expand All @@ -92,18 +92,18 @@ For more information about how to use MultiQC reports, see [https://multiqc.info

**Output files:**

* `multiqc/`
* `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
* `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
* `multiqc_plots/`: directory containing static images from the report in various formats.
- `multiqc/`
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
- `multiqc_plots/`: directory containing static images from the report in various formats.

## Pipeline information

[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

**Output files:**

* `pipeline_info/`
* Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
* Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.csv`.
* Documentation for interpretation of results in HTML format: `results_description.html`.
- `pipeline_info/`
- Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
- Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.csv`.
- Documentation for interpretation of results in HTML format: `results_description.html`.

0 comments on commit 0a24860

Please sign in to comment.