Merge branch 'master' of github.com:EBI-Metagenomics/genomes-pipeline

EBI-Metagenomics · Dec 6, 2022 · 90da54f · 90da54f
2 parents c2056d1 + 9c32779
commit 90da54f
Showing 1 changed file with 8 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -52,13 +52,15 @@ cd containers && bash build.sh
 
 ## Running the pipeline
 
-1. You need to pre-download your data to directoryes and make sure that all genomes are not compressed. If you have downloaded genomes from ENA and NCBI put them into different folders. If you've downloaded genomes from ENA save output CSV file with ENA genomes.
+1. You need to pre-download your data to directories and make sure that all genomes are not compressed. Scripts to fetch genomes from ENA ([fetch_ena.py](https://github.com/EBI-Metagenomics/genomes-pipeline/blob/master/containers/genomes-catalog-update/scripts/fetch_ena.py)) and NCBI ([fetch_ncbi.py](https://github.com/EBI-Metagenomics/genomes-pipeline/blob/master/containers/genomes-catalog-update/scripts/fetch_ncbi.py)) are provided and need to be executed separately from the pipeline. If you have downloaded genomes from both ENA and NCBI put them into different folders.
 
-2. You will need the following information to create YML:
+2. When genomes are fetched from ENA using the `fetch_ena.py` script, a CSV file with contamination and completeness statistics is also created in the same directory where genomes are saved to. If you are downloading genomes differently, a CSV file needs to be created manually (each line should be genome accession, % completeness, % contamination). The ENA fetching script also pre-filters genomes to satisfy the QS50 cut-off (QS = % completeness - 5 * % contamination). If you obtain genomes from NCBI or another source, pre-filtering needs to be done before starting the pipeline unless lower quality genomes are acceptable in the final catalogue. The pipeline will automatically remove genomes with completeness <50% and/or contamination >5%.
+
+3. You will need the following information to create YML:
  - catalogue name (for example, GUT)
- - catalogue version (for example, v1.0)
- - catalogue biom (for example, Human:Gut)
- - min amd max number of accessions (only MGnify specific). Max - Min = #total number of genomes (NCBI+ENA)
+ - catalogue version (for example, 1.0)
+ - catalogue biome (for example, root:Host-associated:Human:Digestive system:Large intestine:Fecal)
+ - min and max number of accessions (only MGnify specific). Max - Min = #total number of genomes (NCBI+ENA)
 
 ### Step by step execution
 
@@ -105,4 +107,4 @@ $ bash run.sh \
 ==== 9. Databases [/test_output/test_catalogue/scripts/step9.test_catalogue.sh] ====
 ==== 10. Re-structure [/test_output/test_catalogue/scripts/step10.test_catalogue.sh] ====
 ==== Final. Exit ====
-```
+```