IMUSE (Integrating Machine-learning and Ultra-high-throughput Screening for Enzyme spaces exploration) offers dual capabilities across the biomolecular landscape: navigating sequence space to capture the effects of mutations via SSM data, and exploring structural space to filter diverse and novel protein backbones.
conda create -n ESM_MPNN python=3.10
conda activate ESM_MPNN
pip install torch==2.0.1+cu117 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-2.0.1+cu117.html
pip install transformers==4.33.3 rdkit==2024.03.3 biopython==1.83 pandas==2.3.3 atom3d==0.2.6 huggingface-hub==0.36.0 datasets scikit-learn scipy sentencepiece ankh aiohttp tqdm wandb
python Structure_space/train.py
python Structure_space/inference.py
Data preparation can be checked in data_prepare.ipynb.
SaProt origin code can be checked in https://github.com/westlake-repl/SaProt,
conda create -n SaProt python=3.10
conda activate SaProt
bash SaProt/environment.sh
If you encounter the following error:
ModuleNotFoundError: No module named 'pkg_resources'
try reinstalling a compatible version of setuptools:
pip install setuptools==65.6.3 --force-reinstallTo run inference using saprot, you can use the following command:
cd SaProt
# export WANDB_MODE=offline
python scripts/training.py --config config/ZN_enzyme/saprot_zn_enzyme.yaml
cd SaProt
python -m model.saprot.inference_csv \
--input_csv path/to/input.csv \
--output_csv path/to/output.csv \
--checkpoint path/to/checkpoint.pt \
--config_path path/to/config \
--start_row 0 \
--end_row 1000 \
--batch_size 32 \
--device cuda
# example command
python -m model.saprot.inference_csv \
--input_csv /storage/caolab/fangmc/code/IMUSE/dataset/raw_SSM_for_inference.csv \
--output_csv /storage/caolab/fangmc/code/IMUSE/dataset/raw_SSM_for_inference_out.csv \
--checkpoint /storage/caolab/fangmc/code/IMUSE/dataset/SaProt_650M_AF2_epoch0_1.pt \ # the checkpoint after train
--config_path /storage/caolab/fangmc/code/IMUSE/SaProt/SaProt_650M_AF2 \ # download from SaProt https://huggingface.co/westlake-repl/SaProt_650M_AF2
--start_row 0 \
--end_row 5 \
--batch_size 1 \
--device cudaCan be checked in benchmark_for_other_models.ipynb
If you find this repository useful, please cite our paper:
@article{su2023saprot,
title={SaProt: Protein Language Modeling with Structure-aware Vocabulary},
author={Su, Jin and Han, Chenchen and Zhou, Yuyang and Shan, Junjie and Zhou, Xibin and Yuan, Fajie},
journal={bioRxiv},
year={2023},
publisher={Cold Spring Harbor Laboratory}