diff --git a/README.md b/README.md index 190eba6..b4aa1ef 100644 --- a/README.md +++ b/README.md @@ -85,6 +85,8 @@ The workflow for YACHT is as follows: 2. Preprocess the reference genomes by removing the "too similar" genomes based on `ANI` using the `ani_thresh` parameter 3. Run YACHT to detect the presence of reference genomes in your sample +
+ ### Creating sketches of your reference database genomes You will need a reference database in the form of [Sourmash](https://sourmash.readthedocs.io/en/latest/) sketches of a collection of microbial genomes. There are a variety of pre-created databases available at: https://sourmash.readthedocs.io/en/latest/databases.html. Our code uses the "Zipfile collection" format, and we suggest using the [GTDB genomic representatives database](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k31.zip): @@ -104,18 +106,20 @@ sourmash sketch dna -f -p k=31,scaled=1000,abund --singleton > dataset.csv sourmash sketch fromfile dataset.csv -p dna,k=31,scaled=1000,abund -o ../training_database.sig.zip # cd back to YACHT + +## Method 2 +# cd into the relevant directory +sourmash sketch dna -f -p k=31,scaled=1000,abund *.fasta -o ../training_database.sig.zip +# cd back to YACHT ``` +
+ ### Creating sketches of your sample You will then create a sketch of your sample metagenome, using the same k-mer size and scale factor @@ -155,6 +159,12 @@ In the two preceding steps, you will obtain a k-mer sketch file in zip format (i ### Preprocess the reference genomes (Training Step) +##### Warning: the training process is time-consuming on large database + +In our benchmark with `GTDB representive genomes`, it takes `15 minutes` using `16 threads, 50GB of MEM` on a system equipped with a `3.5GHz AMD EPYC 7763 64-Core Processor`. The processing time can be significant when executed on GTDB all genomes OR with limited resources. If only part of genomes are needed, one may use `sourmash sig` command to extract signatures of interests only. + +
+ The script `make_training_data_from_sketches.py` extracts the sketches from the Zipfile-format reference database, and then turns them into a form usable by YACHT. In particular, it removes one of any two organisms that have ANI greater than the user-specified threshold as these two organisms are too close to be "distinguishable". ```bash