Skip to content

Commit

Permalink
v0.8.0
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Feb 24, 2022
1 parent 0c3a9ec commit 9896fed
Show file tree
Hide file tree
Showing 38 changed files with 1,049 additions and 518 deletions.
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,39 @@
# Changelog

### v0.8.0 - 2022-02-24

- commands:
- new command `utils cov2simi`: Convert k-mer coverage to sequence similarity.
- new command `utils query-fpr`: Compute the maximal false positive rate of a query.
- `compute`:
- update doc.
- add flags compatibility check.
- `search`:
- **output the false positive rate of each match, rather than the FPR upper bound of the query**.
this could save some short queries with high similarity.
- **change default values of reads filter, because clinical data contain many short reads**.
- `-c/--min-uniq-reads`: `30` -> `10`.
- `-m/--min-query-len`: `70` -> `30`.
- update doc.
- `profile`:
- rename flags:
- `--keep-main-matches` -> `--keep-main-matches`.
- `--keep-perfect-match` -> `--keep-perfect-matches`.
- change default values:
- `--max-qcov-gap`: `0.2` -> `0.4`.
- mode 0 (pathogen detection):
- switch on flag `--keep-main-matches`
- use `--max-qcov-gap 0.4`
- update doc.

### v0.7.1 - 2022-02-08

- `profile`:
- new flag `--metaphlan-report-version` and the default value is `3`. [#4](https://github.com/shenwei356/kmcp/issues/4)
- column name renamed: from `fragsFrac`, `fragsRelDepth`, `fragsRelDepthStd` to `chunksFrac`, `chunksRelDepth`, `chunksRelDepthStd`.
- fix computation of `chunksRelDepth`.
- slightly improve sensitivity for `-m 0`.

### v0.7.0 - 2022-01-24

- commands:
Expand Down
14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

### 1. Accurate metagenomic profiling

KMCP adopts a novol metagenomic profiling strategy,
KMCP adopts a novel metagenomic profiling strategy,
by splitting reference genomes into 10 chunks and mappings reads to these
chunks via fast k-mer matching.
KMCP performs well on both prokaryotic and viral organisms, with higher
Expand All @@ -16,7 +16,7 @@ sensitivity and specificity than other k-mer-based tools

### 2. Fast sequence search against large scales of genomic datasets

KMCP can be used for fast sequence search against large scales of genomic dataset
KMCP can be used for fast sequence search against large scales of genomic datasets
as [BIGSI](https://github.com/Phelimb/BIGSI) and [COBS](https://github.com/bingmann/cobs) do.
We reimplemented and modified the Compact Bit-Sliced Signature index (COBS) algorithm,
bringing a smaller index size and much faster searching speed ([4x-10x faster than COBS](https://bioinf.shenwei.me/kmcp/benchmark/searching/#result))
Expand All @@ -33,7 +33,7 @@ provide fast genome distance estimation using MinHash (Mash) or FracMinHash (Sca
KMCP utilizes multiple k-mer sketches
([Minimizer](https://academic.oup.com/bioinformatics/article/20/18/3363/202143),
[FracMinHash](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2)
(previously named [Scaled MinHash](https://f1000research.com/articles/8-1006)) and
(previously named [Scaled MinHash](https://f1000research.com/articles/8-1006)), and
[Closed Syncmers](https://peerj.com/articles/10805/)) for genome similarity estimation.
[KMCP is 4x-7x faster than Mash/Sourmash](https://bioinf.shenwei.me/kmcp/benchmark/searching/#result)
(check the [tutorial](https://bioinf.shenwei.me/kmcp/tutorial/searching) and [benchmark](https://bioinf.shenwei.me/kmcp/benchmark/searching)).
Expand All @@ -59,7 +59,7 @@ KMCP utilizes multiple k-mer sketches
- [**HPC cluster could linearly accelerate searching**](https://bioinf.shenwei.me/kmcp/benchmark/profiling/#analysis-time-and-storage-requirement) with each computation node hosting a database built with a part of reference genomes.
- Computers with limited main memory would also support searching by building small databases.
- **Accurate taxonomic profiling**
- Some k-mer based taxonomic profilers suffers from high false positive rates,
- Some k-mer based taxonomic profilers suffer from high false positive rates,
while [KMCP adopts multiple strategies](https://bioinf.shenwei.me/kmcp/tutorial/profiling/#methods)
to [improve specificity and keeps high sensitivity at the same time](https://bioinf.shenwei.me/kmcp/benchmark/profiling).
- Except for archaea and bacteria, [KMCP performed well on **virus/phages**](https://bioinf.shenwei.me/kmcp/benchmark/profiling/#16-mock-virome-communities-from-roux-et-al-virusesphages).
Expand Down Expand Up @@ -109,15 +109,17 @@ in two packages for better searching performance.

|subcommand |function |
|:-------------------------------------------------------------------------|:---------------------------------------------------------------|
|[compute](https://bioinf.shenwei.me/kmcp/usage/#compute) |Generate k-mers (sketches) from FASTA/Q sequences |
|[compute](https://bioinf.shenwei.me/kmcp/usage/#compute) |Generate k-mers (sketch) from FASTA/Q sequences |
|[index](https://bioinf.shenwei.me/kmcp/usage/#index) |Construct database from k-mer files |
|[search](https://bioinf.shenwei.me/kmcp/usage/#search) |Search sequences against a database |
|[search](https://bioinf.shenwei.me/kmcp/usage/#search) |Search sequences against a database |
|[merge](https://bioinf.shenwei.me/kmcp/usage/#merge) |Merge search results from multiple databases |
|[profile](https://bioinf.shenwei.me/kmcp/usage/#profile) |Generate taxonomic profile from search results |
|[utils filter](https://bioinf.shenwei.me/kmcp/usage/#filter) |Filter search results and find species/assembly-specific queries|
|[utils merge-regions](https://bioinf.shenwei.me/kmcp/usage/#merge-regions)|Merge species/assembly-specific regions |
|[utils unik-info](https://bioinf.shenwei.me/kmcp/usage/#unik-info) |Print information of .unik file |
|[utils index-info](https://bioinf.shenwei.me/kmcp/usage/#index-info) |Print information of index file |
|[utils cov2simi](https://bioinf.shenwei.me/kmcp/usage/#icov2simi) |Convert k-mer coverage to sequence similarity |
|[utils query-fpr](https://bioinf.shenwei.me/kmcp/usage/#query-fpr) |Compute the maximal false positive rate of a query |

## Quickstart

Expand Down
6 changes: 3 additions & 3 deletions benchmarks/cami2-mouse-gut/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,14 @@ Benchmark results of other tools are downloaded from: https://zenodo.org/search?

## Softwares

- kmcp [v0.7.1](https://github.com/shenwei356/kmcp/releases/tag/v0.7.1)
- kmcp [v0.8.0](https://github.com/shenwei356/kmcp/releases/tag/v0.8.0)
- motus 2.5.1
- metaphlan 2.9.21
- bracken 2.5

## Databases

[Prebuilt databases](https://1drv.ms/u/s!Ag89cZ8NYcqtjVVADr8r--fnKFt-?e=ivNZNK):
[Prebuilt databases and the reference genomes](https://1drv.ms/u/s!Ag89cZ8NYcqtjVVADr8r--fnKFt-?e=ivNZNK):

- DB for bacteria: [refseq-cami2-k21-n10.db.tar.gz](https://1drv.ms/u/s!Ag89cZ8NYcqtjV62KmQmOojxwBRr?e=lp5a9F), [md5](https://1drv.ms/t/s!Ag89cZ8NYcqtjWISqJGcxQD39FCv?e=CQ0E8d)
- DB for viruses: [refseq-cami2-viral-k21-n5.db.tar.gz](https://1drv.ms/u/s!Ag89cZ8NYcqtjVyYFIHY01PtDMcx?e=AO7xkY), [md5](https://1drv.ms/t/s!Ag89cZ8NYcqtjWDTIXL4eMpZNVA0?e=1YXKkk)
Expand All @@ -29,7 +29,7 @@ Reference genomes (Bacteria and Archaea):

1. Microbial genomes were extracted from CAMI2 RefSeq snapshot (`2019-01-08`) using
corresponding taxonomy information .
2. For every species, at most 5 assemblies (sorted by assembly accession) were kept.
2. **For every species, at most 5 assemblies (sorted by assembly accession) were kept**.

Reference genomes (Viruses):

Expand Down
21 changes: 8 additions & 13 deletions benchmarks/mock-virome-roux2016/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

## Softwares

- KMCP [v0.7.1](https://github.com/shenwei356/kmcp/releases/tag/v0.7.1)
- MetaPhlAn [3.0.13 (27 Jul, 2021)](https://github.com/biobakery/MetaPhlAn/releases/tag/3.0.13)
- Kraken [v2.1.2](https://github.com/DerrickWood/kraken2/releases/tag/v2.1.2),
Bracken [v2.6.2](https://github.com/jenniferlu717/Bracken/releases/tag/v2.6.2)
- Centrifuge [v1.0.4](https://github.com/DaehwanKimLab/centrifuge/releases/tag/v1.0.4)
- KMCP [v0.8.0](https://github.com/shenwei356/kmcp/releases/tag/v0.8.0)
- MetaPhlAn [3.0.13 (2021-07-27)](https://github.com/biobakery/MetaPhlAn/releases/tag/3.0.13)
- Kraken [v2.1.2 (2021-05-10)](https://github.com/DerrickWood/kraken2/releases/tag/v2.1.2),
Bracken [v2.6.2 (2021-03-22)](https://github.com/jenniferlu717/Bracken/releases/tag/v2.6.2)
- Centrifuge [v1.0.4 (2021-08-17)](https://github.com/DaehwanKimLab/centrifuge/releases/tag/v1.0.4)

## Databases and taxonomy version

Expand All @@ -29,7 +29,7 @@ to Linker Amplification for quantitative amplification of both dsDNA and ssDNA t
> situations with either low abundance of ssDNA viruses (MCA, total ssDNA ∼2% of
> community) or high abundance of ssDNA viruses (MCB, total ssDNA ∼66% of community)
> Roux S, Solonenko NE, Dang VT, Poulos BT, Schwenck SM, Goldsmith DB, Coleman ML, Breitbart M, Sullivan MB. 2016. Towards quantitative viromics for both double-stranded and single-stranded DNA viruses. PeerJ 4:e2777 https://doi.org/10.7717/peerj.2777
> Roux S, Solonenko NE, Dang VT, Poulos BT, Schwenck SM, Goldsmith DB, Coleman ML, Breitbart M, Sullivan MB. 2016. Towards quantitative viromics for both double-stranded and single-stranded DNA viruses. PeerJ 4:e2777 https://doi.org/10.8.07/peerj.2777
Phages (rank is based on NCBI taxonomy 2021-12-06)

Expand Down Expand Up @@ -83,13 +83,11 @@ After carefully checking, we renamed samples as below:
MCA-G2 MCA-G3
MCA-N2 MCA-N3

We manually download the paired and unpaired reads for every sample, for example:
We manually download the paired-end reads for every sample, for example:

$ ls reads/ | head -n 4
$ ls reads/ | head -n 2
MCA-G2_GGACTCCT-GCGTAAGA_L002_R1_001_t_paired.fastq.gz
MCA-G2_GGACTCCT-GCGTAAGA_L002_R1_001_t_unpaired.fastq.gz
MCA-G2_GGACTCCT-GCGTAAGA_L002_R2_001_t_paired.fastq.gz
MCA-G2_GGACTCCT-GCGTAAGA_L002_R2_001_t_unpaired.fastq.gz

**Note that some files of MCA-S1 and MCA-S2 are corrupted.**

Expand Down Expand Up @@ -127,7 +125,6 @@ We search against GTDB, Genbank-viral, and Refseq-fungi respectively, and merge
| csvtk sort -H -k 1:N \
| rush -v db=$db -v dbname=$dbname -j $j -v j=$J -v 'p={@^(.+)_R1_}' \
'kmcp search -d {db} {p}_R1_001_t_paired.fastq.gz {p}_R2_001_t_paired.fastq.gz \
{p}_R1_001_t_unpaired.fastq.gz {p}_R2_001_t_unpaired.fastq.gz \
-o {p}.kmcp@{dbname}.tsv.gz \
--log {p}.kmcp@{dbname}.tsv.gz.log -j {j}' \
-c -C $reads@$dbname.rush
Expand All @@ -145,7 +142,6 @@ We search against GTDB, Genbank-viral, and Refseq-fungi respectively, and merge
| csvtk sort -H -k 1:N \
| rush -v db=$db -v dbname=$dbname -j $j -v j=$J -v 'p={@^(.+)_R1_}' \
'kmcp search -d {db} {p}_R1_001_t_paired.fastq.gz {p}_R2_001_t_paired.fastq.gz \
{p}_R1_001_t_unpaired.fastq.gz {p}_R2_001_t_unpaired.fastq.gz \
-o {p}.kmcp@{dbname}.tsv.gz \
--log {p}.kmcp@{dbname}.tsv.gz.log -j {j}' \
-c -C $reads@$dbname.rush
Expand All @@ -163,7 +159,6 @@ We search against GTDB, Genbank-viral, and Refseq-fungi respectively, and merge
| csvtk sort -H -k 1:N \
| rush -v db=$db -v dbname=$dbname -j $j -v j=$J -v 'p={@^(.+)_R1_}' \
'kmcp search -d {db} {p}_R1_001_t_paired.fastq.gz {p}_R2_001_t_paired.fastq.gz \
{p}_R1_001_t_unpaired.fastq.gz {p}_R2_001_t_unpaired.fastq.gz \
-o {p}.kmcp@{dbname}.tsv.gz \
--log {p}.kmcp@{dbname}.tsv.gz.log -j {j}' \
-c -C $reads@$dbname.rush
Expand Down
14 changes: 8 additions & 6 deletions benchmarks/profiling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

## Softwares

- KMCP [v0.7.1](https://github.com/shenwei356/kmcp/releases/tag/v0.7.1)
- mOTUs [3.0.1 (Jul 28, 2021)](https://github.com/motu-tool/mOTUs/releases/tag/3.0.1)
- MetaPhlAn [3.0.13 (27 Jul, 2021)](https://github.com/biobakery/MetaPhlAn/releases/tag/3.0.13)
- Kraken [v2.1.2](https://github.com/DerrickWood/kraken2/releases/tag/v2.1.2),
Bracken [v2.6.2](https://github.com/jenniferlu717/Bracken/releases/tag/v2.6.2)
- Centrifuge [v1.0.4](https://github.com/DaehwanKimLab/centrifuge/releases/tag/v1.0.4)
- KMCP [v0.8.0](https://github.com/shenwei356/kmcp/releases/tag/v0.8.0)
- mOTUs [3.0.1 (2021-07-28)](https://github.com/motu-tool/mOTUs/releases/tag/3.0.1)
- MetaPhlAn [3.0.13 (2021-07-27)](https://github.com/biobakery/MetaPhlAn/releases/tag/3.0.13)
- Kraken [v2.1.2 (2021-05-10)](https://github.com/DerrickWood/kraken2/releases/tag/v2.1.2),
Bracken [v2.6.2 (2021-03-22)](https://github.com/jenniferlu717/Bracken/releases/tag/v2.6.2)
- Centrifuge [v1.0.4 (2021-08-17)](https://github.com/DaehwanKimLab/centrifuge/releases/tag/v1.0.4)

## Databases and taxonomy version

Expand All @@ -16,6 +16,7 @@
- mOTUs, 3.0.1 (2021-06-28), 2019-01
- MetaPhlAn, mpa_v30_CHOCOPhlAn_201901 (?), 2019-01
- Kraken, PlusPF (2021-05-17), 2021-05-17
- Kraken, built with the genomes same to KMCP.

## Datasets

Expand Down Expand Up @@ -529,3 +530,4 @@ We search against GTDB, Genbank-viral, and Refseq-fungi respectively, and merge
>{p}.log 2>&1 '

stats centrifuge centrifuge-pe > centrifuge.stats

18 changes: 11 additions & 7 deletions benchmarks/real-pathogen-gu2020/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

## Softwares

- KMCP [v0.7.1](https://github.com/shenwei356/kmcp/releases/tag/v0.7.1)
- Kraken [v2.1.2](https://github.com/DerrickWood/kraken2/releases/tag/v2.1.2),
Bracken [v2.6.2](https://github.com/jenniferlu717/Bracken/releases/tag/v2.6.2)
- Centrifuge [v1.0.4](https://github.com/DaehwanKimLab/centrifuge/releases/tag/v1.0.4)
- KMCP [v0.8.0](https://github.com/shenwei356/kmcp/releases/tag/v0.8.0)
- Kraken [v2.1.2 (2021-05-10)](https://github.com/DerrickWood/kraken2/releases/tag/v2.1.2),
Bracken [v2.6.2 (2021-03-22)](https://github.com/jenniferlu717/Bracken/releases/tag/v2.6.2)
- Centrifuge [v1.0.4 (2021-08-17)](https://github.com/DaehwanKimLab/centrifuge/releases/tag/v1.0.4)

## Databases and taxonomy version

Expand All @@ -14,7 +14,11 @@
- Centrifuge, built with the genomes same to KMCP.
- Kraken, built with the genomes same to KMCP.

**We create the databases of GTDB and Refseq-fungi with a smaller false-positive rate `0.1` instead of `0.3`, and use a small query coverage threshhold `0.4` instead of `0.55`.**
**We create databases of GTDB and Refseq-fungi with a smaller false-positive rate `0.1` instead of `0.3`,
and use `2` hash functions instead of `1`.
The size of GTDB database increase fom 58 to 109GB, and that of Refseq-fungi from 4.2 to 7.9GB.
We use a small query coverage threshhold `0.4` instead of `0.55` during searching and profiling,
and use the re-built mode 0 (pathogen detection) in profiling.**

In this benchmark, we generate metagenomic profiles with the same NCBI Taxonomy version 2021-12-06,
including the gold-standard profiles.
Expand Down Expand Up @@ -110,7 +114,7 @@ We search against GTDB, Genbank-viral, and Refseq-fungi respectively, and merge
fd fastq.gz$ $reads/ \
| csvtk sort -H -k 1:N \
| rush -v db=$db -v dbname=$dbname -j $j -v j=$J -v 'p={:}' \
'kmcp search -d {db} {} \
'kmcp search -t 0.4 -d {db} {} \
-o {p}.kmcp@{dbname}.tsv.gz \
--log {p}.kmcp@{dbname}.tsv.gz.log -j {j}' \
-c -C $reads@$dbname.rush
Expand All @@ -127,7 +131,7 @@ We search against GTDB, Genbank-viral, and Refseq-fungi respectively, and merge
fd fastq.gz$ $reads/ \
| csvtk sort -H -k 1:N \
| rush -v db=$db -v dbname=$dbname -j $j -v j=$J -v 'p={:}' \
'kmcp search -d {db} {} \
'kmcp search -t 0.4 -d {db} {} \
-o {p}.kmcp@{dbname}.tsv.gz \
--log {p}.kmcp@{dbname}.tsv.gz.log -j {j}' \
-c -C $reads@$dbname.rush
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/searching/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Softwares
- [COBS](https://github.com/bingmann/cobs) ([1915fc0](https://github.com/bingmann/cobs/commit/1915fc061bbe47946116b4a051ed7b4e3f3eca15))
- [Sourmash](https://github.com/dib-lab/sourmash) (v4.2.2)
- [Mash](https://github.com/marbl/Mash) (v2.3)
- KMCP ([v0.7.1](https://github.com/shenwei356/kmcp/releases/tag/v0.7.1))
- KMCP ([v0.8.0](https://github.com/shenwei356/kmcp/releases/tag/v0.8.0))

Utilities

Expand Down
18 changes: 11 additions & 7 deletions benchmarks/sim-bact-sun2021/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,25 @@

## Softwares

- KMCP [v0.7.1](https://github.com/shenwei356/kmcp/releases/tag/v0.7.1)
- mOTUs [3.0.1 (Jul 28, 2021)](https://github.com/motu-tool/mOTUs/releases/tag/3.0.1)
- MetaPhlAn [3.0.13 (27 Jul, 2021)](https://github.com/biobakery/MetaPhlAn/releases/tag/3.0.13)
- Kraken [v2.1.2](https://github.com/DerrickWood/kraken2/releases/tag/v2.1.2),
Bracken [v2.6.2](https://github.com/jenniferlu717/Bracken/releases/tag/v2.6.2)
- Centrifuge [v1.0.4](https://github.com/DaehwanKimLab/centrifuge/releases/tag/v1.0.4)
- KMCP [v0.8.0](https://github.com/shenwei356/kmcp/releases/tag/v0.8.0)
- mOTUs [3.0.1 (2021-07-28)](https://github.com/motu-tool/mOTUs/releases/tag/3.0.1)
- MetaPhlAn [3.0.13 (2021-07-27)](https://github.com/biobakery/MetaPhlAn/releases/tag/3.0.13)
- Kraken [v2.1.2 (2021-05-10)](https://github.com/DerrickWood/kraken2/releases/tag/v2.1.2),
Bracken [v2.6.2 (2021-03-22)](https://github.com/jenniferlu717/Bracken/releases/tag/v2.6.2)
- Centrifuge [v1.0.4 (2021-08-17)](https://github.com/DaehwanKimLab/centrifuge/releases/tag/v1.0.4)
- DUDes [v0.08 (2017-11-08)](https://github.com/pirovc/dudes/releases/tag/dudes_v0.08)
- SLIMM [v0.3.4 (2018-09-04)](https://github.com/seqan/slimm/releases/tag/v0.3.4)

## Databases and taxonomy version

- KMCP, GTDB-RS202 (2021-04-27) + Genbank-viral (r246, 2021-12-06) + Refseq-fungi (r208, 2021-09-30), 2021-12-06
- KMCP, GTDB-RS202 (2021-04-27) + Refseq-fungi (r208, 2021-09-30), 2021-12-06
- Centrifuge, built with the genomes same to KMCP.
- mOTUs, 3.0.1 (2021-06-28), 2019-01
- MetaPhlAn, mpa_v30_CHOCOPhlAn_201901 (?), 2019-01
- Kraken, PlusPF (2021-05-17), 2021-05-17
- Kraken, built with the genomes same to KMCP.
- DUDes, built with the genomes same to KMCP.
- SLIMM, built with the genomes same to KMCP.

In this benchmark, we generate metagenomic profiles with the same NCBI Taxonomy version 2021-12-06,
including the gold-standard profiles.
Expand Down
2 changes: 2 additions & 0 deletions commands.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ subcommand function
[utils merge-regions](https://bioinf.shenwei.me/kmcp/usage/#merge-regions) Merge species/assembly-specific regions
[utils unik-info](https://bioinf.shenwei.me/kmcp/usage/#unik-info) Print information of .unik file
[utils index-info](https://bioinf.shenwei.me/kmcp/usage/#index-info) Print information of index file
[utils cov2simi](https://bioinf.shenwei.me/kmcp/usage/#icov2simi) Convert k-mer coverage to sequence similarity
[utils query-fpr](https://bioinf.shenwei.me/kmcp/usage/#query-fpr) Compute the maximal false positive rate of a query
Loading

0 comments on commit 9896fed

Please sign in to comment.