Skip to content

MinchaoFang/IMUSE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IMUSE

header IMUSE (Integrating Machine-learning and Ultra-high-throughput Screening for Enzyme spaces exploration) offers dual capabilities across the biomolecular landscape: navigating sequence space to capture the effects of mutations via SSM data, and exploring structural space to filter diverse and novel protein backbones.

Structure space model

Install packages

conda create -n ESM_MPNN python=3.10
conda activate ESM_MPNN
pip install torch==2.0.1+cu117 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-2.0.1+cu117.html
pip install transformers==4.33.3 rdkit==2024.03.3 biopython==1.83 pandas==2.3.3 atom3d==0.2.6 huggingface-hub==0.36.0 datasets scikit-learn scipy sentencepiece  ankh  aiohttp tqdm wandb

Training

python Structure_space/train.py

Inference Example

python Structure_space/inference.py

Sequence space model

Data preparation can be checked in data_prepare.ipynb.

SaProt origin code can be checked in https://github.com/westlake-repl/SaProt,

Install packages

conda create -n SaProt python=3.10
conda activate SaProt
bash SaProt/environment.sh

Troubleshooting

If you encounter the following error:

ModuleNotFoundError: No module named 'pkg_resources'

try reinstalling a compatible version of setuptools:

pip install setuptools==65.6.3 --force-reinstall

To run inference using saprot, you can use the following command:

Training

cd SaProt
# export WANDB_MODE=offline
python scripts/training.py  --config config/ZN_enzyme/saprot_zn_enzyme.yaml

Inference Example

cd SaProt

python -m model.saprot.inference_csv \
  --input_csv path/to/input.csv \
  --output_csv path/to/output.csv \
  --checkpoint path/to/checkpoint.pt \ 
  --config_path path/to/config \
  --start_row 0 \
  --end_row 1000 \
  --batch_size 32 \
  --device cuda

# example command
python -m model.saprot.inference_csv \
  --input_csv /storage/caolab/fangmc/code/IMUSE/dataset/raw_SSM_for_inference.csv \
  --output_csv /storage/caolab/fangmc/code/IMUSE/dataset/raw_SSM_for_inference_out.csv \
  --checkpoint /storage/caolab/fangmc/code/IMUSE/dataset/SaProt_650M_AF2_epoch0_1.pt \ # the checkpoint after train
  --config_path /storage/caolab/fangmc/code/IMUSE/SaProt/SaProt_650M_AF2 \ # download from SaProt https://huggingface.co/westlake-repl/SaProt_650M_AF2
  --start_row 0 \
  --end_row 5 \
  --batch_size 1 \
  --device cuda

Benchmark for other models

Can be checked in benchmark_for_other_models.ipynb

Citation

If you find this repository useful, please cite our paper:

@article{su2023saprot,
  title={SaProt: Protein Language Modeling with Structure-aware Vocabulary},
  author={Su, Jin and Han, Chenchen and Zhou, Yuyang and Shan, Junjie and Zhou, Xibin and Yuan, Fajie},
  journal={bioRxiv},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors