SPECTRE: A Spectral Transformer for Molecule Identification

Installing packages:

pip install -r requirements.txt

It is more encouraged to directly use my docker image:

gitlab-registry.nrp-nautilus.io/w6xu/guru-docker-image:3699f97c

Dataset Structure

Inner developer dataset (only accessible with project developers), available here

Public dataset, with HSQC spectra removed due to legal concerns, is available here.

Unfortunately, in our repo, we have hard-coded our datasets' name and file paths. Everyone is more than welcome to orgnize the paths. Otherwise, we suggest to

Unzip the dataset zip at /workspace unzip -q DatasetWithoutHSQC.zip -d /workspace/
Git clone our Spectre repo at /root/gurusmart/MorganFP_prediction/reproduce_previous_works

Here OneD_Only_Dataset contains compounds where only 1D NMRs are available. SMILES_dataset contains compounds where 2D HSQC NMRs are always availabel and 1D NMRs are sometimes available

HYUN_FP refers to Hyunwoo Kim's proposed fingerprint used in previous SOTA method: DeepSAT. MW/index.pkl, Chemical/index.pkl, and SMILES/index.pkl records the molecular weight, chemical name, and SMILES string of each comound.

Data Processing before training the models

(No longer required) Execute these scripts to pre-built pickle files if you want to get fast access to the molecules that have all three types of NMRs (HSQC, C-NMR, H-NMR)

{repo_path}/Spectre/notebooks/dataset_building/find_all_info_indices.ipynb

{repo_path}/Spectre/notebooks/dataset_building/generate_all_morganFP.ipynb

Analyze the entropy of each fragments in the dataset (this step is not needed if you can download our inside version of dataset)

bash {repo_path}/Spectre/notebook_and_scripts/SMILES_fragmenting/build_dataset_specific_FP/save_zip.sh

Build retrieval set of more than 500K molecules

Download the pkl file including their SMILES, Names, Molecular Weights

Download the pickle file from goolge-drive and save as /root/gurusmart/MorganFP_prediction/inference_data/coconut_loutus_hyun_training/inference_metadata_latest_RDkit.pkl
Run this script to generate fingerprint of the retrieval sets (by default we use radius from 0 to 6 and build fingperprint of length 16384): step=1 python {repo_path}/Spectre/notebook_and_scripts/dataset_building/build_infernce_set_db_specific.py step=2 python {repo_path}/Spectre/notebook_and_scripts/dataset_building/build_infernce_set_db_specific.py

A brief explanation of the files

datasets:
- hsqc_folder_dataset.py: A dataset file to load a given combination of NMRs, such as only HSQC, HSQC and H-NMR, all three NMRs, etc
- oneD_dataset.py: A dataset file to load a given combination of NMRs, when all the NMRs are 1d NMR, i.e.,only H-NMR, only C-NMR, and only both 1d NMRs
- optional_2d_folder_dataset: It is able to load all possible combinations of NMRs, and it is used to train models taking optional inputs.
models:
- ranked_transformer.py : A Transformer model that can take a specific NMR combination
- ranked_resnet.py: A Resnet model that can take a specific NMR combination
- optional_input_ranked_transformer.py: A Transformer model that can take any NMR combination

Running

To train a model that can take any input NMR combination to predict entropy-based-fingerprint: python train_ranker_transformer.py transformer_2d1d --foldername flexible_models_jittering_size_1 --expname r0_r6 --optional_inputs true --combine_oneD_only_dataset true --random_seed 1 --FP_choice Hash_Entropy_FP_R_6 --out_dim 16384 --jittering 1

If you want to make "Molecular Weight" also an optional input, add an extra flag --optional_MW true

If you want to train a non-flexible model, i.e., only accecpt fixed type of NMR(s), you can refer this github page to see how I schedule the training in different settings.

For example, to train a model that takes only HSQC: python train_ranker_transformer.py transformer_2d1d --foldername train_on_all_data_possible_with_jittering --jittering 1 --random_seed 1 --expname only_hsqc --use_oneD_NMR_no_solvent false --FP_choice Hash_Entropy_FP_R_6 --out_dim 16384 --jittering 1

Name		Name	Last commit message	Last commit date
Latest commit History 539 Commits
.vscode		.vscode
SPECTRE_backend		SPECTRE_backend
datasets		datasets
inference		inference
models		models
notebook_and_scripts		notebook_and_scripts
task_scripts		task_scripts
utils		utils
.gitignore		.gitignore
NP_class_wise_test_rank1.ipynb		NP_class_wise_test_rank1.ipynb
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
test_eHSQC_no_mw.ipynb		test_eHSQC_no_mw.ipynb
test_on_all_info_subset.ipynb		test_on_all_info_subset.ipynb
top_20_NP_classes.pkl		top_20_NP_classes.pkl
train_ranker_transformer.py		train_ranker_transformer.py
visualize_class_wise_accu.ipynb		visualize_class_wise_accu.ipynb
visualize_perf_vs_freq.ipynb		visualize_perf_vs_freq.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPECTRE: A Spectral Transformer for Molecule Identification

Installing packages:

It is more encouraged to directly use my docker image:

Dataset Structure

Data Processing before training the models

(No longer required) Execute these scripts to pre-built pickle files if you want to get fast access to the molecules that have all three types of NMRs (HSQC, C-NMR, H-NMR)

Analyze the entropy of each fragments in the dataset (this step is not needed if you can download our inside version of dataset)

Build retrieval set of more than 500K molecules

A brief explanation of the files

Running

About

Uh oh!

Releases

Packages

Languages

Wang-Bioinformatics-Lab/Spectre

Folders and files

Latest commit

History

Repository files navigation

SPECTRE: A Spectral Transformer for Molecule Identification

Installing packages:

It is more encouraged to directly use my docker image:

Dataset Structure

Data Processing before training the models

(No longer required) Execute these scripts to pre-built pickle files if you want to get fast access to the molecules that have all three types of NMRs (HSQC, C-NMR, H-NMR)

Analyze the entropy of each fragments in the dataset (this step is not needed if you can download our inside version of dataset)

Build retrieval set of more than 500K molecules

A brief explanation of the files

Running

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages