- Aytan-Aktug [1],
- Seq2Geno2Pheno (Seq2Geno&Geno2Pheno) [2],
- PhenotypeSeeker v 0.7.3 [3],
- Kover 2.0 [4],
- ResFinder 4.0 [5], a direct association software based on AMR determinant database, was used as the baseline.
- Dataset overview
- Genome list of each single-species-antibiotic dataset in the form of
Data_<species>_<antibiotic>
- Genome phenotype metadata of each species-antibiotic combination in the form of
Data_<species>_<antibiotic>_pheno.txt
. - Evaluation folds
- Tutorials for creating AMR benchmarking datasets
- Mapping from PATRIC ID to NCBI and GenBank ID
-
Dependencies
-
To reproduce the output, you need to use Linux OS and conda. Miniconda2 4.8.4 was used by us. All software environments were activated under "base" env, which is the default environment.
-
Installation of the conda environments:
git clone https://github.com/hzi-bifo/AMR_benchmarking.git cd AMR_benchmarking bash ./install/install.sh #Create 9 pieces of conda environments and install packages respectively
-
For Kover, please refer to Kover to try other installation methods.
-
Finally, you need to install PyTorch in the
multi_torch_env
manually. To install PyTorch compatible with your CUDA version, please follow this instruction: https://pytorch.org/get-started/locally/. Our code was tested with pytorch v1.7.1, with CUDA Version 10.1 and 11.0 .
-
-
Memory requirement: Some procedures require extremely large memory. Aytan-Aktug multi-species model (adapted version) feature-building procedure needs ~370G memory. Other ML software needs up to 80G memory, depending on the number of CPUs and the specific species-antibiotic combination.
-
Disk storage requirement: Some procedures generate extremely large intermediate files, although they are deleted once finished in our pipeline. E.G. PhenotypeSeeker(adapted version) needs the most disk storage, which is up to the magnitude of 10T depending on the species.
The input file is a YAML file Config.yaml
at the root folder where all options are described:
A. Basic/required parameters setting
- Please change everything in A after the ":" to your own.
option | action | values ([default]) |
---|---|---|
dataset_location | To where the PATRIC data will be downloaded. ~246G | /vol/projects/BIFO/patric_genome |
output_path | To where to generate the Results folder for the direct results of each software and further visualization. |
./ |
log_path | To where to generate the log folder for the intermediate files (~10 TB, while regularly cleaning files related to completed benchmarking species). |
./ |
n_jobs | CPU cores (>1) to use. | 10 |
gpu_on | GPU possibility for Aytan-Aktug SSSA model, If set to False, parallelization on CPU will be applied; Otherwise, it will be applied on one gpu core sequentially. | False |
clean_software | Clean large intermediate files of the specified software (optional). Large temp files can also be manually removed from <log_path>/log/software/<software_name>/software_output . |
B.Optional parameters setting
- Please change the conda environment names if the same names already exist in your working PC.
option | action | values ([default]) |
---|---|---|
amr_env_name,amr_env_name2 | conda env for general use | amr_env,amr2 |
PhenotypeSeeker_env_name | conda env for PhenotypeSeeker | PhenotypeSeeker_env |
multi_env_name | conda env for | multi_env |
multi_torch_env_name | conda env for NN model | multi_torch_env |
kover_env_name | conda env for Kover | kover_env |
se2ge_env_name | conda env for Seg2Geno | snakemake_env |
kmer_env_name | conda env for Seg2Geno k-mers generation | kmer_kmc |
phylo_name | conda env for Seg2Geno phylogenetic trees generation | phylo_env |
phylo_name2 | conda env for visualization of misclassified genomes | phylo_env2 |
resfinder_env | conda env for ResFinder | res_env |
C. Advanced/optional parameters setting
- You can evaluate for a subset of species at a time by modifying the values of the 'species_list', 'species_list_phylotree', and 'species_list_multi_antibiotics' options.
- For multi-species models , we have listed all the possible species in terms of dataset this study provides; you can explore as you like by making new combinations of the listed species. Users, who would like to reproduce this AMR benchmarking results, are not advised to change settings in this category.
option | action | values ([default]) |
---|---|---|
species_list | Benchmarked species under random and homology-aware folds for single-species evaluation | Escherichia_coli, Staphylococcus_aureus, Salmonella_enterica, Klebsiella_pneumoniae, Pseudomonas_aeruginosa, Acinetobacter_baumannii, Streptococcus_pneumoniae, Mycobacterium_tuberculosis, Campylobacter_jejuni, Enterococcus_faecium, Neisseria_gonorrhoeae |
species_list_phylotree | Benchmarked species under phylogeny-aware folds for single-species evaluation | Escherichia_coli, Staphylococcus_aureus, Salmonella_enterica, Klebsiella_pneumoniae, Pseudomonas_aeruginosa, Acinetobacter_baumannii, Streptococcus_pneumoniae, Campylobacter_jejuni, Enterococcus_faecium, Neisseria_gonorrhoeae |
species_list_multi_antibiotics | Benchmarked species for single-species multi-antibiotic model. | Mycobacterium_tuberculosis, Escherichia_coli, Staphylococcus_aureus, Salmonella_enterica, Klebsiella_pneumoniae, Pseudomonas_aeruginosa, Acinetobacter_baumannii, Streptococcus_pneumoniae, Neisseria_gonorrhoeae |
species_list_multi_species | Benchmarked species for multi-species models. | Mycobacterium_tuberculosis, Salmonella_enterica, Streptococcus_pneumoniae, Escherichia_coli, Staphylococcus_aureus, Klebsiella_pneumoniae, Acinetobacter_baumannii, Pseudomonas_aeruginosa, Campylobacter_jejuni |
cv_number | The k value of k-fold nested cross-validation | 10 |
QC_criteria | Sample quality control level. Can be loose or strict. | loose |
└── Results
├── final_figures_tables
├── other_figures_tables
├── supplement_figures_tables
└── software
├── AytanAktug
├── kover
├── majority
├── phenotypeseeker
├── resfinder_b
├── resfinder_folds
├── resfinder_k
└── seq2geno
- Cross-validation results of each ML software and evaluation results of Resfinder are generated under
output_path/Results/software/<name of the software>
. - Visualization tables and graphs are generated under
output_path/Results/final_figures_tables
andoutput_path/Results/supplement_figures_tables
. - Numbers and statistic results mentioned in our benchmarking article are generated under
output_path/Results/other_figures_tables
.
git clone https://github.com/hzi-bifo/AMR_benchmarking.git
cd AMR_benchmarking
bash main.sh #details of usage were explained in main.sh. You can't finish the whole AMR benchmarking just by setting this command to run once.
bash ./scripts/model/clean.sh # Optional. Clean intermediate files
- One could see
main.sh
for benchmarking workflow.
- One could use
clean.sh
to clean large and less important intermediate files. You can run it any time after the specified software finishes running on a benchmarked species. Don't use it when the corresponding software is running on a new benchmarked species.
[1] D Aytan-Aktug, Philip Thomas Lanken Conradsen Clausen, Valeria Bortolaia, Frank Møller Aarestrup, and Ole Lund. Prediction of acquired antimicrobial resistance for multiple bacterial species using neural networks.Msystems, 5(1), 2020.
[2] Ariane Khaledi, Aaron Weimann, Monika Schniederjans, Ehsaneddin Asgari, Tzu-Hao Kuo, Antonio Oliver, Gabriel Cabot, Axel Kola, Petra Gastmeier, Michael Hogardt, et al. Predicting antimicrobial resistance in pseudomonas aeruginosa with machine learning-enabled molecular diagnostics. EMBO molecular medicine, 12(3):e10264, 2020.
[3] Erki Aun, Age Brauer, Veljo Kisand, Tanel Tenson, and Maido Remm. A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria. PLoS computational biology, 14(10):e1006434, 2018.
[4] Alexandre Drouin, Gaël Letarte, Frédéric Raymond, Mario Marchand, Jacques Corbeil, and François Laviolette. Interpretable genotype-to-phenotype classifiers with performance guarantees. Scientific reports, 9(1):1–13, 2019.
[5] Valeria Bortolaia, Rolf S Kaas, Etienne Ruppe, Marilyn C Roberts, Stefan Schwarz, Vincent Cattoir, Alain Philippon, Rosa L Allesoe, Ana Rita Rebelo, Alfred Ferrer Florensa, et al. Resfinder 4.0 for predictions of phenotypes from genotypes. Journal of Antimicrobial Chemotherapy, 75(12): 3491–3500, 2020.
MIT License
- Open an issue in the repository.
- Send an email to Kaixin Hu ([email protected]).