pip install -r requirements.txtgitlab-registry.nrp-nautilus.io/w6xu/guru-docker-image:3699f97c
Inner developer dataset (only accessible with project developers), available here
Public dataset, with HSQC spectra removed due to legal concerns, is available here.
Unfortunately, in our repo, we have hard-coded our datasets' name and file paths. Everyone is more than welcome to orgnize the paths. Otherwise, we suggest to
-
Unzip the dataset zip at /workspace
unzip -q DatasetWithoutHSQC.zip -d /workspace/ -
Git clone our Spectre repo at
/root/gurusmart/MorganFP_prediction/reproduce_previous_works
Here OneD_Only_Dataset contains compounds where only 1D NMRs are available. SMILES_dataset contains compounds where 2D HSQC NMRs are always availabel and 1D NMRs are sometimes available
HYUN_FP refers to Hyunwoo Kim's proposed fingerprint used in previous SOTA method: DeepSAT. MW/index.pkl, Chemical/index.pkl, and SMILES/index.pkl records the molecular weight, chemical name, and SMILES string of each comound.
(No longer required) Execute these scripts to pre-built pickle files if you want to get fast access to the molecules that have all three types of NMRs (HSQC, C-NMR, H-NMR)
{repo_path}/Spectre/notebooks/dataset_building/find_all_info_indices.ipynb
{repo_path}/Spectre/notebooks/dataset_building/generate_all_morganFP.ipynb
Analyze the entropy of each fragments in the dataset (this step is not needed if you can download our inside version of dataset)
bash {repo_path}/Spectre/notebook_and_scripts/SMILES_fragmenting/build_dataset_specific_FP/save_zip.sh
Download the pkl file including their SMILES, Names, Molecular Weights
-
Download the pickle file from goolge-drive and save as
/root/gurusmart/MorganFP_prediction/inference_data/coconut_loutus_hyun_training/inference_metadata_latest_RDkit.pkl -
Run this script to generate fingerprint of the retrieval sets (by default we use radius from 0 to 6 and build fingperprint of length 16384):
step=1 python {repo_path}/Spectre/notebook_and_scripts/dataset_building/build_infernce_set_db_specific.pystep=2 python {repo_path}/Spectre/notebook_and_scripts/dataset_building/build_infernce_set_db_specific.py
-
datasets:
- hsqc_folder_dataset.py: A dataset file to load a given combination of NMRs, such as only HSQC, HSQC and H-NMR, all three NMRs, etc
- oneD_dataset.py: A dataset file to load a given combination of NMRs, when all the NMRs are 1d NMR, i.e.,only H-NMR, only C-NMR, and only both 1d NMRs
- optional_2d_folder_dataset: It is able to load all possible combinations of NMRs, and it is used to train models taking optional inputs.
-
models:
- ranked_transformer.py : A Transformer model that can take a specific NMR combination
- ranked_resnet.py: A Resnet model that can take a specific NMR combination
- optional_input_ranked_transformer.py: A Transformer model that can take any NMR combination
To train a model that can take any input NMR combination to predict entropy-based-fingerprint:
python train_ranker_transformer.py transformer_2d1d --foldername flexible_models_jittering_size_1 --expname r0_r6 --optional_inputs true --combine_oneD_only_dataset true --random_seed 1 --FP_choice Hash_Entropy_FP_R_6 --out_dim 16384 --jittering 1
- If you want to make "Molecular Weight" also an optional input, add an extra flag
--optional_MW true
If you want to train a non-flexible model, i.e., only accecpt fixed type of NMR(s), you can refer this github page to see how I schedule the training in different settings.
For example, to train a model that takes only HSQC:
python train_ranker_transformer.py transformer_2d1d --foldername train_on_all_data_possible_with_jittering --jittering 1 --random_seed 1 --expname only_hsqc --use_oneD_NMR_no_solvent false --FP_choice Hash_Entropy_FP_R_6 --out_dim 16384 --jittering 1