From 0a2486072c98026bb124ac0a11ecf83dddcb414e Mon Sep 17 00:00:00 2001 From: Igor Trujnara Date: Wed, 4 Dec 2024 14:09:54 +0100 Subject: [PATCH] Make linter happy --- CHANGELOG.md | 64 +++++++++++++++++++++++++------------------------- README.md | 10 ++++---- docs/output.md | 40 +++++++++++++++---------------- 3 files changed, 56 insertions(+), 58 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 58945f55..feb530d8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,7 +5,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## v1.0.0dev -__The content below is the unaltered changelog of the unreleased 2020 version of the pipeline.__ +**The content below is the unaltered changelog of the unreleased 2020 version of the pipeline.** ## v0.1.0dev - [date] @@ -13,59 +13,59 @@ Initial release of nf-core/kmermaid, created with the [nf-core](https://nf-co.re ### `Added` -* Add option to use Dayhoff encoding for sourmash. -* Add `bam2fasta` process to kmermaid pipeline and flags involved. -* Add `extract_coding` and `peptide_bloom_filter` process and flags involved. -* Add `track_abundance` feature to keep track of hashed kmer frequency. -* Add social preview image -* Add `fastp` process for trimming reads -* Add option to use compressed `.tgz` file containing output from 10X Genomics' `cellranger count` outputs, including `possorted_genome_bam.bam` and `barcodes.tsv` files -* Add samtools_fastq_unaligned and samtools_fastq_aligned process for converting bam to per cell -barcode fastq -* Add version printing for sencha, bam2fasta, and sourmash in Dockerfile, update versions in environment.yml -* For processes translate, sourmash compute add cpus=1 as they are only serial ([#107](https://github.com/nf-core/kmermaid/pull/107)) -* Add `sourmash sig merge` for aligned/unaligned signatures from bam files, and add `--skip_sig_merge` option to turn it off -* Add `--protein_fastas` option for creating sketches of already-translated protein sequences -* Add `--skip_compare option` to skip `sourmash_compare_sketches` process -* Add merging of aligned/unaligned parts of single-cell data ([#117](https://github.com/nf-core/kmermaid/pull/117)) -* Add renamed package dependency orpheum (used to be known as sencha) +- Add option to use Dayhoff encoding for sourmash. +- Add `bam2fasta` process to kmermaid pipeline and flags involved. +- Add `extract_coding` and `peptide_bloom_filter` process and flags involved. +- Add `track_abundance` feature to keep track of hashed kmer frequency. +- Add social preview image +- Add `fastp` process for trimming reads +- Add option to use compressed `.tgz` file containing output from 10X Genomics' `cellranger count` outputs, including `possorted_genome_bam.bam` and `barcodes.tsv` files +- Add samtools_fastq_unaligned and samtools_fastq_aligned process for converting bam to per cell + barcode fastq +- Add version printing for sencha, bam2fasta, and sourmash in Dockerfile, update versions in environment.yml +- For processes translate, sourmash compute add cpus=1 as they are only serial ([#107](https://github.com/nf-core/kmermaid/pull/107)) +- Add `sourmash sig merge` for aligned/unaligned signatures from bam files, and add `--skip_sig_merge` option to turn it off +- Add `--protein_fastas` option for creating sketches of already-translated protein sequences +- Add `--skip_compare option` to skip `sourmash_compare_sketches` process +- Add merging of aligned/unaligned parts of single-cell data ([#117](https://github.com/nf-core/kmermaid/pull/117)) +- Add renamed package dependency orpheum (used to be known as sencha) ### `Fixed` #### Resources -* Increase CPUs in `high_memory_long` profile from 1 to 10 +- Increase CPUs in `high_memory_long` profile from 1 to 10 #### Naming -* Rename splitkmer to `split_kmer` +- Rename splitkmer to `split_kmer` #### Per-cell fastqs and bams -* Remove `one_signature_per_record` flag and add bam2fasta count_umis_percell and make_fastqs_percell instead of bam2fasta sharding method -* Use ripgrep instead of bam2fasta to make per-cell fastq, which will hopefully make resuming long-running pipelines on bams much faster -* Make sure `samtools_fastq_aligned` outputs ALL aligned reads, regardless of mapping quality or primary alignment status +- Remove `one_signature_per_record` flag and add bam2fasta count_umis_percell and make_fastqs_percell instead of bam2fasta sharding method +- Use ripgrep instead of bam2fasta to make per-cell fastq, which will hopefully make resuming long-running pipelines on bams much faster +- Make sure `samtools_fastq_aligned` outputs ALL aligned reads, regardless of mapping quality or primary alignment status #### Sourmash -* add `--skip_compute option` to skip `sourmash_compute_sketch_*` -* Used `.combine()` instead of `each` to do cartesian product of all possible molecules, ksizes, and sketch values -* Do `sourmash compute` on all input ksizes, and all peptide molecule types, at once to save disk reading/writing efforts +- add `--skip_compute option` to skip `sourmash_compute_sketch_*` +- Used `.combine()` instead of `each` to do cartesian product of all possible molecules, ksizes, and sketch values +- Do `sourmash compute` on all input ksizes, and all peptide molecule types, at once to save disk reading/writing efforts #### Translate -* Updated sencha=1.0.3 to fix the bug in memory errors possibly with the numpy array on unique filenames ([PR #96 on orpheum](https://github.com/czbiohub/orpheum/pull/96)) -* Add option to write non-coding nucleotide sequences fasta files while doing sencha translate -* Don't save translate csvs and jsons by default, add separate `--save_translate_json` and `--save_translate_csv` -* Updated `sencha translate` default parameters to be `--ksize 8 --jaccard-threshold 0.05` because those were the most successful -* Update renaming of `khtools` commands to `sencha` +- Updated sencha=1.0.3 to fix the bug in memory errors possibly with the numpy array on unique filenames ([PR #96 on orpheum](https://github.com/czbiohub/orpheum/pull/96)) +- Add option to write non-coding nucleotide sequences fasta files while doing sencha translate +- Don't save translate csvs and jsons by default, add separate `--save_translate_json` and `--save_translate_csv` +- Updated `sencha translate` default parameters to be `--ksize 8 --jaccard-threshold 0.05` because those were the most successful +- Update renaming of `khtools` commands to `sencha` #### MultiQC -* Fix the use of `skip_multiqc` flag condition with if and not when +- Fix the use of `skip_multiqc` flag condition with if and not when ### `Dependencies` ### `Deprecated` -* Removed ability to specify multiple `--scaled` or `--num-hashes` values to enable merging of signatures +- Removed ability to specify multiple `--scaled` or `--num-hashes` values to enable merging of signatures diff --git a/README.md b/README.md index 5eef247f..d535dab8 100644 --- a/README.md +++ b/README.md @@ -19,14 +19,14 @@ ## Introduction -**nf-core/kmermaid** is a bioinformatics pipeline that performs comparative analysis of *omes using k-mer based methods. It supports various reference and sequencing input formats, and provides statistics files along with a MultiQC report as output. It provides pre-processing methods for reads and alignments. +**nf-core/kmermaid** is a bioinformatics pipeline that performs comparative analysis of \*omes using k-mer based methods. It supports various reference and sequencing input formats, and provides statistics files along with a MultiQC report as output. It provides pre-processing methods for reads and alignments. In the outline below, every step except for the main analysis is optional and might be input-dependent. -__Optional – BAM preprocessing__ +**Optional – BAM preprocessing** 1. Extract BAM from 10X archive (`tar`) 2. Extract FASTQ reads ([`samtools`](http://www.htslib.org/)) @@ -35,22 +35,20 @@ __Optional – BAM preprocessing__ 5. Download SRA experiment () [optional] -__Optional – read preprocessing__ +**Optional – read preprocessing** 6. Trim reads ([`fastp`](https://github.com/OpenGene/fastp)) 7. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) 8. Remove rRNA ([`sortmerna`](https://github.com/sortmerna/sortmerna)) 9. Translate to protein ([`orpheum`](https://github.com/czbiohub-sf/orpheum)) -__k-mer analysis per method__ +**k-mer analysis per method** 10. Create sketch 11. Calculate distances 12. Present the results ([`MultiQC`](http://multiqc.info/)) - - ## Usage ### With a samples.csv file diff --git a/docs/output.md b/docs/output.md index b481b0b5..2b5863d5 100644 --- a/docs/output.md +++ b/docs/output.md @@ -15,11 +15,11 @@ The directories listed below will be created in the results directory after the The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: -* [FastQC](#fastqc) - read quality control -* [MultiQC](#multiqc) - Aggregate report describing results from the whole pipeline -* [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution -* [Sourmash Sketch](#sourmash-sketch) - Compute a k-mer sketch of each sample -* [Sourmash Compare](#sourmash-compare) - Compare all samples on k-mer sketches +- [FastQC](#fastqc) - read quality control +- [MultiQC](#multiqc) - Aggregate report describing results from the whole pipeline +- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution +- [Sourmash Sketch](#sourmash-sketch) - Compute a k-mer sketch of each sample +- [Sourmash Compare](#sourmash-compare) - Compare all samples on k-mer sketches ## FastQC @@ -31,10 +31,10 @@ For further reading and documentation see the [FastQC help pages](http://www.bio **Output files:** -* `fastqc/` - * `*_fastqc.html`: FastQC report containing quality metrics for your untrimmed raw fastq files. -* `fastqc/zips/` - * `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images. +- `fastqc/` + - `*_fastqc.html`: FastQC report containing quality metrics for your untrimmed raw fastq files. +- `fastqc/zips/` + - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images. > **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. @@ -46,7 +46,7 @@ For further reading and documentation see the [FastQC help pages](http://www.bio For each sample and provided `molecules`, `ksizes` and `sketch_num_hashes_log2`, a file is created: -* `sample_molecule-${molecule}__ksize-${ksize}__${sketch_value}__track_abundance-${track_abundance}.sig` +- `sample_molecule-${molecule}__ksize-${ksize}__${sketch_value}__track_abundance-${track_abundance}.sig` For example: @@ -71,8 +71,8 @@ SRR4050379_molecule-protein_ksize-9_sketch_num_hashes_log2-4.sig For each provided `molecules`, `ksizes` and `sketch_num_hashes_log2`, a file is created containing a symmetric matrix of the similarity between all samples, written as a comma-separated variable file: -* `similarities_molecule-${molecule}__ksize-${ksize}__${sketch_value}__track_abundance-${track_abundance}.csv` -For example, +- `similarities_molecule-${molecule}__ksize-${ksize}__${sketch_value}__track_abundance-${track_abundance}.csv` + For example, ```bash similarities_molecule-dna_ksize-9_sketch_num_hashes_log2-4.csv @@ -92,10 +92,10 @@ For more information about how to use MultiQC reports, see [https://multiqc.info **Output files:** -* `multiqc/` - * `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser. - * `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline. - * `multiqc_plots/`: directory containing static images from the report in various formats. +- `multiqc/` + - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser. + - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline. + - `multiqc_plots/`: directory containing static images from the report in various formats. ## Pipeline information @@ -103,7 +103,7 @@ For more information about how to use MultiQC reports, see [https://multiqc.info **Output files:** -* `pipeline_info/` - * Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. - * Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.csv`. - * Documentation for interpretation of results in HTML format: `results_description.html`. +- `pipeline_info/` + - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. + - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.csv`. + - Documentation for interpretation of results in HTML format: `results_description.html`.