Welcome to the computer-assisted chemical synthesis data source research project !!!
Over the last decade, computer-assisted chemical synthesis has re-emerged as a heavily researched subject in Chemoinformatics. Even though the idea of utilizing computers to assist chemical synthesis has existed for nearly as long as computers themselves, the expected blend of reliability and innovation has repeatedly been proven difficult to achieve. Nevertheless, recent machine learning approaches have exhibited the potential to address these shortcomings. The data utilized by such approaches frequently lack quality and quantity, are stored in various formats, or are published behind paywalls, all of which can be significant barriers to entry, especially for novice researchers. Consequently, the main objective of this research project is to systematically curate and facilitate access to relevant open computer-assisted chemical synthesis data sources.
A standalone environment can be created using the git and conda commands as follows:
git clone https://github.com/neo-chem-synth-wave/data-source.git
cd data-source
conda env create -f environment.yaml
conda activate data-source-env
The data_source package can be installed using the pip command as follows:
pip install --no-build-isolation -e .
The purpose of the scripts directory is to illustrate how to download, extract, and format the following categories of computer-assisted chemical synthesis data:
The download_extract_and_format_data script can be utilized as follows:
# Use Case #1: Get the chemical reaction data source name information.
python scripts/download_extract_and_format_data.py \
--data_source_category "reaction" \
--get_data_source_name_information
# Use Case #2: Get the USPTO chemical reaction dataset version information.
python scripts/download_extract_and_format_data.py \
--data_source_category "reaction" \
--data_source_name "uspto" \
--get_data_source_version_information
# Use Case #3: Download, extract, and format the data from the USPTO chemical reaction dataset.
python scripts/download_extract_and_format_data.py \
--data_source_category "reaction" \
--data_source_name "uspto" \
--data_source_version "v_50k_by_20171116_coley_c_w_et_al" \
--output_directory_path "path/to/the/output/directory"
The following chemical compound data sources are supported:
The chemical compound data source relationships can be illustrated as follows:
The following ZINC [1, 2, 3] chemical compound database versions are supported:
Version | DOI | Status |
---|---|---|
v_building_blocks_{building_block_subset_name} [2] | 10.1021/acs.jcim.0c00675 |
π’ |
v_catalog_{catalog_name} [2] | 10.1021/acs.jcim.0c00675 |
π’ |
π’ Completely Implemented
The following ChEMBL [4] chemical compound database versions are supported:
Version | DOI | Status |
---|---|---|
v_release_{release_number β₯ 25} [4] | 10.6019/CHEMBL.database.{release_number} |
π’ |
π’ Completely Implemented
The following COCONUT [5, 6] chemical compound database versions are supported:
Version | DOI | Status |
---|---|---|
v_2_0_by_20241126_chandrasekhar_v_et_al [6] | 10.5281/zenodo.13382750 |
π’ |
v_2_0_complete_by_20241126_chandrasekhar_v_et_al [6] | 10.5281/zenodo.13382750 |
π’ |
π’ Completely Implemented
The following miscellaneous chemical compound data sources are supported:
Version | DOI | Status |
---|---|---|
v_moses_by_20201218_polykovskiy_d_et_al [7] | 10.3389/fphar.2020.565644 |
π’ |
π’ Completely Implemented
The following chemical compound pattern data sources are supported:
The chemical compound pattern data source relationships can be illustrated as follows:
The following RDKit [8] chemical compound pattern dataset versions are supported:
Version | DOI | Status |
---|---|---|
v_brenk_by_20080307_brenk_r_et_al [9] | 10.1002/cmdc.200700139 |
π’ |
v_pains_by_20100204_baell_j_b_and_holloway_g_a [10] | 10.1021/jm901137j |
π’ |
π’ Completely Implemented
The following chemical reaction data sources are supported:
- United States Patent and Trademark Office (USPTO)
- Open Reaction Database (ORD)
- Chemical Reaction Database (CRD)
- Rhea
- Miscellaneous Chemical Reaction Data Sources
The chemical reaction data source relationships can be illustrated as follows:
The following United States Patent and Trademark Office (USPTO) [11] chemical reaction dataset versions are supported:
Version | DOI | Status |
---|---|---|
v_1976_to_2013_rsmi_by_20121009_lowe_d_m [11] | 10.6084/m9.figshare.12084729.v1 |
π’ |
v_50k_by_20141226_schneider_n_et_al [12] | 10.1021/ci5006614 |
π’ |
v_50k_by_20161122_schneider_n_et_al [13] | 10.1021/acs.jcim.6b00564 |
π’ |
v_15k_by_20170418_coley_c_w_et_al [14] | 10.1021/acscentsci.7b00064 |
π’ |
v_1976_to_2016_cml_by_20121009_lowe_d_m [11] | 10.6084/m9.figshare.5104873.v1 |
π‘ |
v_1976_to_2016_rsmi_by_20121009_lowe_d_m [11] | 10.6084/m9.figshare.5104873.v1 |
π’ |
v_50k_by_20170905_liu_b_et_al [15] | 10.1021/acscentsci.7b00303 |
π’ |
v_50k_by_20171116_coley_c_w_et_al [16] | 10.1021/acscentsci.7b00355 |
π’ |
v_480k_or_mit_by_20171204_jin_w_et_al [17] | 10.48550/arXiv.1709.04555 |
π’ |
v_480k_or_mit_by_20180622_schwaller_p_et_al [18] | 10.1039/C8SC02339E |
π’ |
v_stereo_by_20180622_schwaller_p_et_al [18] | 10.1039/C8SC02339E |
π’ |
v_lef_by_20181221_bradshaw_j_et_al [19] | 10.48550/arXiv.1805.10970 |
π’ |
v_1k_tpl_by_20210128_schwaller_p_et_al [20] | 10.1038/s42256-020-00284-w |
π’ |
v_1976_to_2016_remapped_by_20210407_schwaller_p_et_al [21] | 10.1126/sciadv.abe4166 |
π’ |
v_1976_to_2016_remapped_by_20240313_chen_s_et_al [22] | 10.6084/m9.figshare.25046471.v1 |
π’ |
v_50k_remapped_by_20240313_chen_s_et_al [22] | 10.6084/m9.figshare.25046471.v1 |
π’ |
v_mech_31k_by_20240810_chen_s_et_al [23] | 10.6084/m9.figshare.24797220.v2 |
π’ |
π’ Completely Implemented
π‘ Partially Implemented (Limited to Reaction SMILES Strings)
The following Open Reaction Database (ORD) [24] versions are supported:
Version | DOI | Status |
---|---|---|
v_release_0_1_0 [24] | 10.1021/jacs.1c09820 |
π‘ |
v_release_main [24] | 10.1021/jacs.1c09820 |
π‘ |
π’ Completely Implemented
π‘ Partially Implemented (Limited to Reaction SMILES Strings)
The following Chemical Reaction Database (CRD) [25] versions are supported:
Version | DOI | Status |
---|---|---|
v_reaction_smiles_2001_to_2021 [25] | 10.6084/m9.figshare.20279733.v1 |
π’ |
v_reaction_smiles_2001_to_2023 [25] | 10.6084/m9.figshare.22491730.v1 |
π’ |
v_reaction_smiles_2023 [25] | 10.6084/m9.figshare.24921555.v1 |
π’ |
v_reaction_smiles_1976_to_2024 [25] | 10.6084/m9.figshare.28230053.v1 |
π’ |
π’ Completely Implemented
The following Rhea [26] chemical reaction database versions are supported:
Version | DOI | Status |
---|---|---|
v_release_{release_number β₯ 126} [26] | 10.1093/nar/gkab1016 |
π’ |
π’ Completely Implemented
The following miscellaneous chemical reaction data sources are supported:
Version | DOI | Status |
---|---|---|
v_20131008_kraut_h_et_al [27] | 10.1021/ci400442f |
π’ |
v_20161014_wei_j_n_et_al [28] | 10.1021/acscentsci.6b00219 |
π’ |
v_20200508_grambow_c_et_al [29] | 10.5281/zenodo.3581266 |
π’ |
v_add_on_by_20200508_grambow_c_et_al [29] | 10.5281/zenodo.3731553 |
π’ |
v_golden_dataset_by_20211103_lin_a_et_al [30] | 10.1002/minf.202100138 |
π’ |
v_rdb7_by_20220718_spiekermann_k_et_al [31] | 10.5281/zenodo.5652097 |
π’ |
v_orderly_condition_by_20240422_wigh_d_s_et_al [32] | 10.6084/m9.figshare.23298467.v4 |
π’ |
v_orderly_forward_by_20240422_wigh_d_s_et_al [32] | 10.6084/m9.figshare.23298467.v4 |
π’ |
v_orderly_retro_by_20240422_wigh_d_s_et_al [32] | 10.6084/m9.figshare.23298467.v4 |
π’ |
π’ Completely Implemented
The following chemical reaction pattern data sources are supported:
The chemical reaction pattern data source relationships can be illustrated as follows:
The following RetroRules [33] chemical reaction pattern database versions are supported:
Version | DOI | Status |
---|---|---|
v_release_rr01_rp2_hs [33] | 10.5281/zenodo.5827427 |
π’ |
v_release_rr02_rp2_hs [33] | 10.5281/zenodo.5828017 |
π’ |
v_release_rr02_rp3_hs [33] | 10.5281/zenodo.5827977 |
π’ |
v_release_rr02_rp3_nohs [33] | 10.5281/zenodo.5827969 |
π’ |
π’ Completely Implemented
The following miscellaneous chemical reaction pattern data sources are supported:
Version | DOI | Status |
---|---|---|
v_retro_transform_db_by_20180421_avramova_s_et_al [34] | 10.5281/zenodo.1209312 |
π’ |
v_dingos_by_20190701_button_a_et_al [35] | 10.24433/CO.6930970.v1 |
π’ |
v_auto_template_by_20240627_chen_l_and_li_y [36] | 10.1186/s13321-024-00869-2 |
π’ |
π’ Completely Implemented
The purpose of the data directory is to archive data sources hosted on GitHub, GitLab, and CodeOcean repositories.
The contents of this repository are published under the MIT license. Please refer to individual references for more details regarding the license information of external resources utilized within this repository.
If you are interested in contributing to this repository by reporting bugs, suggesting improvements, or submitting feedback, feel free to do so using GitHub Issues.
[1] Sterling, T. and Irwin, J.J. ZINC15 β Ligand Discovery for Everyone. J. Chem. Inf. Model., 2015, 55, 11, 2324-2337.
[2] Irwin, J.J., Tang, K.G., Young, J., Dandarchuluun, C., Wong, B.R., Khurelbaatar, M., Moroz, Y.S., Mayfield, J., and Sayle, R.A. ZINC20 - A Free Ultralarge-Scale Chemical Database for Ligand Discovery. J. Chem. Inf. Model., 2020, 60, 12, 6065-6073.
[3] Tingle, B.I., Tang, K.G., Castanon, M., Gutierrez, J.J., Khurelbaatar, M., Dandarchuluun, C., Moroz, Y.S., and Irwin, J.J. ZINC22 - A Free Multi-billion-scale Database of Tangible Compounds for Ligand Discovery. J. Chem. Inf. Model., 2023, 63, 4, 1166-1176.
[4] Zdrazil, B., Felix, E., Hunter, F., Manners, E.J., Blackshaw, J., Corbett, S., de Veij, M., Ioannidis, H., Lopez, D.M., Mosquera, J.F., Magarinos, M.P., Bosc, N., Arcila, R., KizilΓΆren, T., Gaulton, A., Bento, A.P., Adasme, M.F., Monecke, P., Landrum, G.A., and Leach, A.R. The ChEMBL Database in 2023: A Drug Discovery Platform Spanning Multiple Bioactivity Data Types and Time Periods. Nucleic Acids Research, 52, D1, 2024, D1180-D1192.
[5] Sorokina, M., Merseburger, P., Rajan, K., Yirik, M.A., and Steinbeck, C. COCONUT Online: Collection of Open Natural Products Database. J. Cheminform., 13, 2, 2021.
[6] Chandrasekhar, V., Rajan, K., Kanakam, S.R.S., Sharma, N., WeiΓenborn, V., Schaub, J., and Steinbeck, C. COCONUT 2.0: A Comprehensive Overhaul and Curation of the Collection of Open Natural Products Database. Nucleic Acids Research, 53, D1, 2025, D634βD643.
[7] Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golovanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Artamonov, A., Aladinskiy, V., Veselov, M., Kadurin, A., Johansson, S., Chen, H., Nikolenko, S., Aspuru-Guzik, A., and Zhavoronkov, A. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front. Pharmacol., 11, 2020.
[8] RDKit: Open-source Cheminformatics: https://www.rdkit.org. Accessed on: February 8th, 2025.
[9] Brenk, R., Schipani, A., James, D., Krasowski, A., Gilbert, I.H., Frearson, J. and Wyatt, P.G. Lessons Learnt from Assembling Screening Libraries for Drug Discovery for Neglected Diseases. ChemMedChem, 3, 435-444.
[10] Baell, J.B. and Holloway, G.A. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for their Exclusion in Bioassays. J. Med. Chem., 2010, 53, 7, 2719β2740.
[11] Lowe, D.M. Extraction of Chemical Structures and Reactions from the Literature. Ph.D. Thesis, University of Cambridge, Department of Chemistry, Pembroke College, 2012.
[12] Schneider, N., Lowe, D.M., Sayle, R.A., and Landrum, G.A. Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-scale Reaction Classification and Similarity. J. Chem. Inf. Model., 2015, 55, 1, 39β53.
[13] Schneider, N., Stiefl, N., and Landrum, G.A. Whatβs What: The (Nearly) Definitive Guide to Reaction Role Assignment. J. Chem. Inf. Model., 2016, 56, 12, 2336β2346.
[14] Coley, C.W., Barzilay, R., Jaakkola, T.S., Green, W.H., and Jensen, K.F. Prediction of Organic Reaction Outcomes using Machine Learning. ACS Cent. Sci., 2017, 3, 5, 434β443.
[15] Liu, B., Ramsundar, B., Kawthekar, P., Shi, J., Gomes, J., Nguyen, Q.L., Ho, S., Sloane, J., Wender, P., and Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-sequence Models. ACS Cent. Sci., 2017, 3, 10, 1103-1113.
[16] Coley, C.W., Rogers, L., Green, W.H., and Jensen, K.F. Computer-assisted Retrosynthesis Based on Molecular Similarity. J. Chem. Inf. Model., 2017, 3, 12, 1237β1245.
[17] Jin, W., Coley, C.W., Barzilay, R., and Jaakkola. T. Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network. Advances in Neural Information Processing Systems, 30, 2017.
[18] Schwaller, P., Gaudin, T., LΓ‘nyi, D., Bekas, C., and Laino, T. "Found in Translation": Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-sequence Models. Chem. Sci., 2018, 9, 6091-6098.
[19] Bradshaw, J., Kusner, M.J., Paige, B., Segler, M.H.S., and HernΓ‘ndez-Lobato, M.J. A Generative Model for Electron Paths. International Conference on Learning Representations, 2019.
[20] Schwaller, P., Probst, D., Vaucher, A.C., Nair, V.H., Kreutter, D., Laino, T., and Reymond, J. Mapping the Space of Chemical Reactions using Attention-based Neural Networks. Nat. Mach. Intell., 3, 144-152, 2021.
[21] Schwaller, P., Hoover, B., Reymond, J., Strobelt, H., and Laino, T. Extraction of Organic Chemistry Grammar from Unsupervised Learning of Chemical Reactions. Sci. Adv., eabe4166, 2021.
[22] Chen, S., An, S., Babazade, R., and Jung, Y. Precise Atom-to-atom Mapping for Organic Reactions via Human-in-the-loop Machine Learning. Nat. Commun., 15, 2250, 2024.
[23] Chen, S., Babazade, R., Kim, T., Han, S., and Jung, Y. A Large-scale Reaction Dataset of Mechanistic Pathways of Organic Reactions. Sci. Data, 11, 863, 2024.
[24] Kearnes, S.M., Maser, M.R., Wleklinski, M., Kast, A., Doyle, A.G., Dreher, S.D., Hawkins, J.M., Jensen, K.F., and Coley, C.W. The Open Reaction Database. J. Am. Chem. Soc., 2021, 143, 45, 18820β18826.
[25] The Chemical Reaction Database (CRD): https://kmt.vander-lingen.nl. Accessed on: February 8th, 2025.
[26] Bansal, P., Morgat, A., Axelsen, K.B., Muthukrishnan, V., Coudert, E., Aimo, L., Hyka-Nouspikel, N., Gasteiger, E., Kerhornou, A., Neto, T.B., Pozzato, M., Blatter, M., Ignatchenko, A., Redaschi, N., and Bridge, A. Rhea, the Reaction Knowledgebase in 2022. Nucleic Acids Research, 50, D1, 2022, D693βD700.
[27] Kraut, H., Eiblmaier, J., Grethe, G., LΓΆw, P., Matuszczyk, H., and Saller, H. Algorithm for Reaction Classification. J. Chem. Inf. Model., 2013, 53, 11, 2884β2895.
[28] Wei, J.N., Duvenaud, D., and Aspuru-Guzik, A. Neural Networks for the Prediction of Organic Chemistry Reactions. ACS Cent. Sci., 2016, 2, 10, 725β732.
[29] Grambow, C.A., Pattanaik, L., and Green, W.H. Reactants, Products, and Transition States of Elementary Chemical Reactions based on Quantum Chemistry. Sci. Data, 7, 137, 2020.
[30] Lin, A., Dyubankova, N., Madzhidov, T.I., Nugmanov, R.I., Verhoeven, J., Gimadiev, T.R., Afonina, V.A., Ibragimova, Z., Rakhimbekova, A., Sidorov, P., Gedich, A., Suleymanov, R., Mukhametgaleev, R., Wegner, J., Ceulemans, H., Varnek, A. Atom-to-atom Mapping: A Benchmarking Study of Popular Mapping Algorithms and Consensus Strategies. Mol. Inf., 2022, 41, 2100138.
[31] Spiekermann, K., Pattanaik, L., and Green, W.H. High Accuracy Barrier Heights, Enthalpies, and Rate Coefficients for Chemical Reactions. Sci. Data, 9, 417, 2022.
[32] Wigh, D.S., Arrowsmith, J., Pomberger, A., Felton, K.C., and Lapkin, A.A. ORDerly: Data Sets and Benchmarks for Chemical Reaction Data. J. Chem. Inf. Model., 2024, 64, 9, 3790β3798.
[33] Duigou, T., du Lac, M., Carbonell, P., and Faulon, J. RetroRules: A Database of Reaction Rules for Engineering Biology. Nucleic Acids Research, 47, D1, 2019, D1229βD1235.
[34] Avramova, S., Kochev, N., and Angelov, P. RetroTransformDB: A Dataset of Generic Transforms for Retrosynthetic Analysis. Data, 2018, 3, 14.
[35] Button, A., Merk, D., Hiss, J.A., and Schneider, G. Automated De Novo Molecular Design by Hybrid Machine Intelligence and Rule-driven Chemical Synthesis. Nat. Mach. Intell., 1, 307-315, 2019.
[36] Chen, L. and Li, Y. AutoTemplate: Enhancing Chemical Reaction Datasets for Machine Learning Applications in Organic Chemistry. J. Cheminform., 16, 74, 2024.