Skip to content

AaronFeller/PeptideCLM-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PeptideMTR

This work was developed as a collaboration between Novo Nordisk and the Wilke lab at The University of Texas at Austin.

Authors

  • Aaron L. Feller [1,2]* ([email protected])
  • Maxim Secor [1]
  • Sebastian Swanson [1]
  • Claus O. Wilke [2]
  • Kristine Deibler [1]

[1] Molecular AI, Novo Nordisk [2]; Integrative Biology, The University of Texas at Austin


Table of Contents

Introduction

PeptideMTR is transformer-based representation learning suite for therapeutic peptides. The project investigates how explicit physicochemical information (99 RDKit descriptors) used during training can enhance the predictive power of peptide models.

The framework benchmarks three distinct architectural approaches:

  1. MLM (Masked Language Modeling): Purely sequence-based learning via amino acid tokens.
  2. MTR-only: Regression models trained using a curated set of 99 RDKit physicochemical descriptors.
  3. MLM-MTR (Hybrid): A dual-objective architecture that leverages both latent sequence patterns and explicit chemical descriptors during the training phase.

Getting Started

Installation

This repository uses pyproject.toml for dependency management. We recommend using uv for an extremely fast and reproducible setup.

  1. Clone the repository:

    git clone https://github.com/aaronfeller/PeptideMTR.git
    cd PeptideMTR
    
  2. Install dependencies and create a virtual environment: Using uv, you can sync the entire environment in seconds:

    uv sync
    
  3. Activate the environment:

    source .venv/bin/activate  # On macOS/Linux
    .venv\Scripts\activate     # On Windows
    

Alternatively, you can install the packages using standard pip:

pip install .

Usage

PeptideMTR models are designed for ease of use. Regardless of the training objective (including the MTR variants), the finalized models accept a SMILES string as the primary input for inference.

Models

All 9 model variants associated with the forthcoming paper are hosted on Hugging Face: huggingface.co/aaronfeller.

Model Variant Strategy Training Features
PeptideMTR-MLM Sequence Pre-training Masked SMILES tokens
PeptideMTR-MTR Multi-Target Regression 99 RDKit Descriptors
PeptideMTR-Hybrid Split-Head Architecture Masked SMILES tokens & 99 RDKit Descriptors

Tokenizer

The project utilizes a custom tokenizer optimized for the peptide chemical space. This ensures robust handling of both standard and non-canonical amino acids, facilitating the mapping of SMILES strings to the model's latent space.

Datasets

The training and validation data used to develop these models—including the 99 pre-computed RDKit descriptors and their corresponding biochemical targets—are available at PeptideMTR_pretraining_data

Contributing

Contributions are welcome! Please submit a pull request or open an issue to discuss any changes.

License

The author(s) are protected under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages