Skip to content

neo-chem-synth-wave/data-source

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

56 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

The Data Source Project

Static Badge Static Badge

Welcome to the computer-assisted chemical synthesis data source project !!!

Over the last decade, computer-assisted chemical synthesis has re-emerged as a heavily researched subject in Chemoinformatics. Even though the idea of utilizing computers to assist chemical synthesis has existed for nearly as long as computers themselves, the expected blend of reliability and innovation has repeatedly been proven difficult to achieve. Nevertheless, recent machine learning approaches have exhibited the potential to address these shortcomings. The open-source data utilized by such approaches frequently lack quality and quantity, are stored in various formats, or are published behind paywalls, all of which can be significant barriers to entry, especially for novice researchers. Consequently, the main objective of this project is to systematically curate and facilitate access to relevant computer-assisted chemical synthesis data sources.

Installation

A standalone environment can be created using the git and conda commands as follows:

git clone https://github.com/neo-chem-synth-wave/data-source.git

cd data-source

conda env create -f environment.yaml

conda activate data-source-env

The data_source package can be installed using the pip command as follows:

pip install --no-build-isolation -e .

Utilization

The purpose of the scripts directory is to illustrate how to download, extract, and format the following types of computer-assisted chemical synthesis data:

The download_extract_and_format_data script can be utilized as follows:

# Example #1: Get the chemical reaction rule data source name information.
python scripts/download_extract_and_format_data.py \
  --data_source_category "reaction_rule" \
  --get_data_source_name_information

# Example #2: Get the ZINC chemical compound database version information.
python scripts/download_extract_and_format_data.py \
  --data_source_category "compound" \
  --data_source_name "zinc" \
  --get_data_source_version_information

# Example #3: Download, extract, and format the data from the USPTO (50k) chemical reaction dataset.
python scripts/download_extract_and_format_data.py \
  --data_source_category "reaction" \
  --data_source_name "uspto" \
  --data_source_version "v_50k_by_20171116_coley_c_w_et_al" \
  --output_directory_path "path/to/the/output/directory"

Chemical Compounds

The following chemical compound data sources are supported:

chemical_compound_data_sources.png

ChEMBL

The following ChEMBL chemical compound database versions are supported:

Version DOI Status
v_release_{release_number β‰₯ 25} [1] 10.6019/CHEMBL.database.{release_number} 🟒

🟒 Completely Implemented

ZINC

The following ZINC chemical compound database versions are supported:

Version DOI Status
v_building_blocks_{building_block_subset_name} [2] 10.1021/acs.jcim.0c00675 🟒
v_catalog_{catalog_name} [2] 10.1021/acs.jcim.0c00675 🟒
v_moses_by_20201218_polykovskiy_d_et_al [3] 10.3389/fphar.2020.565644 🟒

🟒 Completely Implemented

Chemical Reactions

The following chemical reaction data sources are supported:

chemical_reaction_data_sources.png

United States Patent and Trademark Office (USPTO)

The following United States Patent and Trademark Office (USPTO) chemical reaction dataset versions are supported:

Version DOI Status
v_1976_to_2013_rsmi_by_20121009_lowe_d_m [4] 10.6084/m9.figshare.12084729.v1 🟒
v_50k_by_20141226_schneider_n_et_al [5] 10.1021/ci5006614 🟒
v_50k_by_20161122_schneider_n_et_al [6] 10.1021/acs.jcim.6b00564 🟒
v_15k_by_20170418_coley_c_w_et_al [7] 10.1021/acscentsci.7b00064 🟒
v_1976_to_2016_cml_by_20121009_lowe_d_m [4] 10.6084/m9.figshare.5104873.v1 🟑
v_1976_to_2016_rsmi_by_20121009_lowe_d_m [4] 10.6084/m9.figshare.5104873.v1 🟒
v_50k_by_20170905_liu_b_et_al [8] 10.1021/acscentsci.7b00303 🟒
v_50k_by_20171116_coley_c_w_et_al [9] 10.1021/acscentsci.7b00355 🟒
v_480k_or_mit_by_20171204_jin_w_et_al [10] 10.48550/arXiv.1709.04555 🟒
v_480k_or_mit_by_20180622_schwaller_p_et_al [11] 10.1039/C8SC02339E 🟒
v_stereo_by_20180622_schwaller_p_et_al [11] 10.1039/C8SC02339E 🟒
v_lef_by_20181221_bradshaw_j_et_al [12] 10.48550/arXiv.1805.10970 🟒
v_1k_tpl_by_20210128_schwaller_p_et_al [13] 10.1038/s42256-020-00284-w 🟒
v_1976_to_2016_remapped_by_20210407_schwaller_p_et_al [14] 10.1126/sciadv.abe4166 🟒
v_1976_to_2016_remapped_by_20240313_chen_s_et_al [15] 10.6084/m9.figshare.25046471.v1 🟒
v_50k_remapped_by_20240313_chen_s_et_al [15] 10.6084/m9.figshare.25046471.v1 🟒
v_mech_31k_by_20240810_chen_s_et_al [16] 10.6084/m9.figshare.24797220.v2 🟒

🟒 Completely Implemented
🟑 Partially Implemented (Limited to Reaction SMILES Strings)

Open Reaction Database (ORD)

The following Open Reaction Database (ORD) versions are supported:

Version DOI Status
v_release_0_1_0 [17] 10.1021/jacs.1c09820 🟑
v_release_main [17] 10.1021/jacs.1c09820 🟑
v_orderly_condition_by_20240422_wigh_d_s_et_al [18] 10.6084/m9.figshare.23298467.v4 🟒
v_orderly_forward_by_20240422_wigh_d_s_et_al [18] 10.6084/m9.figshare.23298467.v4 🟒
v_orderly_retro_by_20240422_wigh_d_s_et_al [18] 10.6084/m9.figshare.23298467.v4 🟒

🟒 Completely Implemented
🟑 Partially Implemented (Limited to Reaction SMILES Strings)

Chemical Reaction Database (CRD)

The following Chemical Reaction Database (CRD) versions are supported:

Version DOI Status
v_reaction_smiles_2001_to_2021 [19] 10.6084/m9.figshare.20279733.v1 🟒
v_reaction_smiles_2001_to_2023 [19] 10.6084/m9.figshare.22491730.v1 🟒
v_reaction_smiles_2023 [19] 10.6084/m9.figshare.24921555.v1 🟒

🟒 Completely Implemented

Rhea

The following Rhea chemical reaction database versions are supported:

Version DOI Status
v_release_{release_number β‰₯ 126} [20] 10.1093/nar/gkab1016 🟒

🟒 Completely Implemented

Miscellaneous Chemical Reaction Data Sources

The following miscellaneous chemical reaction data sources are supported:

Version DOI Status
v_20131008_kraut_h_et_al [21] 10.1021/ci400442f 🟒
v_20161014_wei_j_n_et_al [22] 10.1021/acscentsci.6b00219 🟒
v_20200508_grambow_c_et_al [23] 10.5281/zenodo.3581266 🟒
v_add_on_by_20200508_grambow_c_et_al [23] 10.5281/zenodo.3731553 🟒
v_golden_dataset_by_20211103_lin_a_et_al [24] 10.1002/minf.202100138 🟒
v_rdb7_by_20220718_spiekermann_k_et_al [25] 10.5281/zenodo.5652097 🟒

🟒 Completely Implemented

Chemical Reaction Rules

The following chemical reaction rule data sources are supported:

chemical_reaction_rule_data_sources.png

RetroRules

The following RetroRules chemical reaction rule database versions are supported:

Version DOI Status
v_release_rr01_rp2_hs [26] 10.5281/zenodo.5827427 🟒
v_release_rr02_rp2_hs [26] 10.5281/zenodo.5828017 🟒
v_release_rr02_rp3_hs [26] 10.5281/zenodo.5827977 🟒
v_release_rr02_rp3_nohs [26] 10.5281/zenodo.5827969 🟒

🟒 Completely Implemented

Miscellaneous Chemical Reaction Rule Data Sources

The following miscellaneous chemical reaction rule data sources are supported:

Version DOI Status
v_retro_transform_db_by_20180421_avramova_s_et_al [27] 10.5281/zenodo.1209312 🟒
v_dingos_by_20190701_button_a_et_al [28] 10.24433/CO.6930970.v1 🟒

🟒 Completely Implemented

Data

The purpose of the data directory is to archive the data sources that are hosted on GitHub and CodeOcean repositories.

License Information

The contents of this repository are published under the MIT license. Please refer to individual references for more details regarding the license information of external resources utilized within this repository.

Contact

If you are interested in contributing to this repository by reporting bugs, suggesting improvements, or submitting feedback, feel free to use GitHub Issues.

References

[1] Zdrazil, B., Felix, E., Hunter, F., Manners, E.J., Blackshaw, J., Corbett, S., de Veij, M., Ioannidis, H., Lopez, D.M., Mosquera, J.F., Magarinos, M.P., Bosc, N., Arcila, R., KizilΓΆren, T., Gaulton, A., Bento, A.P., Adasme, M.F., Monecke, P., Landrum, G.A., and Leach, A.R. The ChEMBL Database in 2023: A Drug Discovery Platform Spanning Multiple Bioactivity Data Types and Time Periods. Nucleic Acids Research, 52, D1, 2024, D1180-D1192.

[2] Irwin, J.J., Tang, K.G., Young, J., Dandarchuluun, C., Wong, B.R., Khurelbaatar, M., Moroz, Y.S., Mayfield, J., and Sayle, R.A. ZINC20 - A Free Ultralarge-Scale Chemical Database for Ligand Discovery. J. Chem. Inf. Model., 2020, 60, 12, 6065-6073.

[3] Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golovanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Artamonov, A., Aladinskiy, V., Veselov, M., Kadurin, A., Johansson, S., Chen, H., Nikolenko, S., Aspuru-Guzik, A., and Zhavoronkov, A. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front. Pharmacol., 11, 2020.

[4] Lowe, D.M. Extraction of Chemical Structures and Reactions from the Literature. Ph.D. Thesis, University of Cambridge, Department of Chemistry, Pembroke College, 2012.

[5] Schneider, N., Lowe, D.M., Sayle, R.A., and Landrum, G.A. Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-scale Reaction Classification and Similarity. J. Chem. Inf. Model., 2015, 55, 1, 39–53.

[6] Schneider, N., Stiefl, N., and Landrum, G.A. What’s What: The (Nearly) Definitive Guide to Reaction Role Assignment. J. Chem. Inf. Model., 2016, 56, 12, 2336–2346.

[7] Coley, C.W., Barzilay, R., Jaakkola, T.S., Green, W.H., and Jensen, K.F. Prediction of Organic Reaction Outcomes using Machine Learning. ACS Cent. Sci., 2017, 3, 5, 434–443.

[8] Liu, B., Ramsundar, B., Kawthekar, P., Shi, J., Gomes, J., Nguyen, Q.L., Ho, S., Sloane, J., Wender, P., and Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-sequence Models. ACS Cent. Sci., 2017, 3, 10, 1103-1113.

[9] Coley, C.W., Rogers, L., Green, W.H., and Jensen, K.F. Computer-assisted Retrosynthesis Based on Molecular Similarity. J. Chem. Inf. Model., 2017, 3, 12, 1237–1245.

[10] Jin, W., Coley, C.W., Barzilay, R., and Jaakkola. T. Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network. Advances in Neural Information Processing Systems, 30, 2017.

[11] Schwaller, P., Gaudin, T., LΓ‘nyi, D., Bekas, C., and Laino, T. "Found in Translation": Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-sequence Models. Chem. Sci., 2018, 9, 6091-6098.

[12] Bradshaw, J., Kusner, M.J., Paige, B., Segler, M.H.S., and HernΓ‘ndez-Lobato, M.J. A Generative Model for Electron Paths. International Conference on Learning Representations, 2019.

[13] Schwaller, P., Probst, D., Vaucher, A.C., Nair, V.H., Kreutter, D., Laino, T., and Reymond, J. Mapping the Space of Chemical Reactions using Attention-based Neural Networks. Nat. Mach. Intell., 3, 144-152, 2021.

[14] Schwaller, P., Hoover, B., Reymond, J., Strobelt, H., and Laino, T. Extraction of Organic Chemistry Grammar from Unsupervised Learning of Chemical Reactions. Sci. Adv., eabe4166, 2021.

[15] Chen, S., An, S., Babazade, R., and Jung, Y. Precise Atom-to-atom Mapping for Organic Reactions via Human-in-the-loop Machine Learning. Nat. Commun., 15, 2250, 2024.

[16] Chen, S., Babazade, R., Kim, T., Han, S., and Jung, Y. A Large-scale Reaction Dataset of Mechanistic Pathways of Organic Reactions. Sci. Data, 11, 863, 2024.

[17] Kearnes, S.M., Maser, M.R., Wleklinski, M., Kast, A., Doyle, A.G., Dreher, S.D., Hawkins, J.M., Jensen, K.F., and Coley, C.W. The Open Reaction Database. J. Am. Chem. Soc., 2021, 143, 45, 18820–18826.

[18] Wigh, D.S., Arrowsmith, J., Pomberger, A., Felton, K.C., and Lapkin, A.A. ORDerly: Data Sets and Benchmarks for Chemical Reaction Data. J. Chem. Inf. Model., 2024, 64, 9, 3790–3798.

[19] The Chemical Reaction Database (CRD): https://kmt.vander-lingen.nl. Accessed on: August 25th, 2024.

[20] Bansal, P., Morgat, A., Axelsen, K.B., Muthukrishnan, V., Coudert, E., Aimo, L., Hyka-Nouspikel, N., Gasteiger, E., Kerhornou, A., Neto, T.B., Pozzato, M., Blatter, M., Ignatchenko, A., Redaschi, N., and Bridge, A. Rhea, the Reaction Knowledgebase in 2022. Nucleic Acids Research, 50, D1, 2022, D693–D700.

[21] Kraut, H., Eiblmaier, J., Grethe, G., LΓΆw, P., Matuszczyk, H., and Saller, H. Algorithm for Reaction Classification. J. Chem. Inf. Model., 2013, 53, 11, 2884–2895.

[22] Wei, J.N., Duvenaud, D., and Aspuru-Guzik, A. Neural Networks for the Prediction of Organic Chemistry Reactions. ACS Cent. Sci., 2016, 2, 10, 725–732.

[23] Grambow, C.A., Pattanaik, L., and Green, W.H. Reactants, Products, and Transition States of Elementary Chemical Reactions based on Quantum Chemistry. Sci. Data, 7, 137, 2020.

[24] Lin, A., Dyubankova, N., Madzhidov, T.I., Nugmanov, R.I., Verhoeven, J., Gimadiev, T.R., Afonina, V.A., Ibragimova, Z., Rakhimbekova, A., Sidorov, P., Gedich, A., Suleymanov, R., Mukhametgaleev, R., Wegner, J., Ceulemans, H., Varnek, A. Atom-to-atom Mapping: A Benchmarking Study of Popular Mapping Algorithms and Consensus Strategies. Mol. Inf., 2022, 41, 2100138.

[25] Spiekermann, K., Pattanaik, L., and Green, W.H. High Accuracy Barrier Heights, Enthalpies, and Rate Coefficients for Chemical Reactions. Sci. Data, 9, 417, 2022.

[26] Duigou, T., du Lac, M., Carbonell, P., and Faulon, J. RetroRules: A Database of Reaction Rules for Engineering Biology. Nucleic Acids Research, 47, D1, 2019, D1229–D1235.

[27] Avramova, S., Kochev, N., and Angelov, P. RetroTransformDB: A Dataset of Generic Transforms for Retrosynthetic Analysis. Data, 2018, 3, 14.

[28] Button, A., Merk, D., Hiss, J.A., and Schneider, G. Automated De Novo Molecular Design by Hybrid Machine Intelligence and Rule-driven Chemical Synthesis. Nat. Mach. Intell., 1, 307-315, 2019.