Skip to content

gitter-lab/benchmarking-structure-based-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploring zero-shot structure-based protein fitness prediction

DOI DOI

This is a repository of code needed to replicate benchmarks of structure-based protein fitness prediction models in ProteinGym.

Exploring zero-shot structure-based protein fitness prediction.
Arnav Sharma, Anthony Gitter.
Generative and Experimental Perspectives for Biomolecular Design Workshop at the 13th International Conference on Learning Representations. 2025.

Table of Contents

Setup

Conda environment setup

The conda environment can be setup using

  conda env create -f environment.yml

SSEmb setup

  • Download the test.tar.gz file from the Zenodo dataset repository and place it in the assets directory of this repo.
  • Use scripts/setup_ssemb.sh to setup the data and code.
  • Follow instructions on the SSEmb Github to run ProteinGym assays.

ProteinGym setup for running ESM-IF1 with experimental structures

  • Download the experimental_struct_artifacts.tar.gz file from the Zenodo dataset repository and place it in the assets directory of this repo.
  • Use scripts/setup_proteinGym.sh to setup the data and code.
  • Follow instructions on the ProteinGym Github to setup scipts/zero_shot_config.sh within ProteinGym.
  • cd into the cloned ProteinGym/scripts/scoring_DMS_zero_shot directory. Run sh score_multichain_structs_esm.sh to generate ESM-IF1 scores for experimental structures.

Accessing Data:

ProteinGym Data:

Get zero-shot scores:

wget https://marks.hms.harvard.edu/proteingym/zero_shot_substitutions_scores.zip

These scores can be extracted to the location of your choosing and then each script can be pointed to this location.

Generated Data:

Data to run experiments can be found in our Zenodo dataset repository.

  • experimental_struct_artifacts.tar.gz contains the experimentally determined structures for ProteinGym assays used in our analysis along with the reference file needed to generate ESM-IF1 predictions for these structures in ProteinGym. This compressed file contains:

  • results.tar.gz contains the prediction results obtained by running SSEmb on the 216 ProteinGym assays being considered in this study. Files compressed are:

    • df_total_proteingym_1.csv which has all the SSEmb predictions for the 216 ProteinGym assays.
    • experimental_struct_scores.csv which has all the ESM inverse folding predictions for the 61 assays with experimental structures.
  • test.tar.gz contains all the structures from ProteinGym as well as MSAs generated using mmseqs2. To use this directory:

    • Setup SSEmb as directed in its repository
    • Download this file and extract it in the data folder.

About

This is a repository of code needed to replicate benchmarks of structure based models in ProteinGym

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •