This repository provides a simplified, educational implementation of AlphaFold, the groundbreaking protein structure prediction model. It's perfect for learning core concepts of protein folding and exploring associated machine learning techniques. It includes a toy protein dataset and the building blocks for a simplified model.
-
Virtual Environment: Create and activate a virtual environment (recommended):
python3 -m venv venv source venv/bin/activate
-
Install Dependencies: Install required packages (NumPy, SciPy, Matplotlib, tqdm, and PyTorch):
pip install -r requirements.txt
python minifold_dataset.py
This script generates a toy protein dataset with a reduced amino acid alphabet and a simplified energy function. The data is split into train, validation, and test sets and saved in the protein_dataset
directory. A usage_example.py
file is also generated within this directory.
python preprocess_dataset.py
This script preprocesses the generated dataset, including padding, relative position encoding, normalization, and caching. The preprocessed data is saved in the preprocessed_protein_dataset
directory (.pt
files). Dataset statistics are also saved.
import torch
from torch.utils.data import DataLoader
from preprocessed_protein_dataset import PreprocessedProteinDataset
# Load datasets
train_dataset = PreprocessedProteinDataset.load('preprocessed_protein_dataset/train')
val_dataset = PreprocessedProteinDataset.load('preprocessed_protein_dataset/val')
test_dataset = PreprocessedProteinDataset.load('preprocessed_protein_dataset/test')
# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
# Example: Iterate through the train loader ♻️
for batch in train_loader:
sequence = batch['sequence'] # Padded sequence indices
coords = batch['coordinates'] # 2D coordinates
distance_matrix = batch['distance_matrix'] # Padded distance matrix
position_encoding = batch['position_encoding'] # Padded position encoding
energy = batch['energy'] # Protein energies
length = batch['length'] # Original sequence lengths
properties = batch['properties'] # Hydrophobic and charged ratios
# Process the batch (e.g., feed to your model)
print(f"Sequence shape: {sequence.shape}")
print(f"Coordinates shape: {coords.shape}")
# ... process other features ...
-
mini_alphafold.py
: The main model architecture (SimplifiedAlphaFold
). Integrates thePairformer
andDiffusion
modules to predict protein structures. Accepts sequences, distance matrices, positional encodings, and other properties as input, outputting 3D coordinates. Includes atraining_step
function. -
pairformer.py
: Implements a simplified Pairformer block using Triangle Attention. Focuses on pairwise amino acid interactions, using distances between residues for more informed attention calculations. -
simplediffusion.py
: A simplified diffusion model. Generates protein structures by denoising random Gaussian noise conditioned on sequence embeddings and timesteps.
This project is under active development! Dataset generation and preprocessing are complete. Model training and inference are next.
train.py
: (Planned) Training script formini_alphafold.py
. Handles training loops, loss calculation, optimization, and checkpointing.inference.py
: (Planned) Inference script to predict 3D structures of new sequences.- Model Refinement: (Future) Advanced loss functions, attention mechanisms, and other improvements.
- Dataset Expansion: (Future) Larger amino acid alphabet and/or more realistic energy functions.
- Evaluation Metrics: (Future) Implement metrics to assess model performance.
minifold_dataset.py
: Dataset generationpreprocess_dataset.py
: Dataset preprocessingmini_alphafold.py
: Simplified AlphaFold modeltrain.py
: (Planned) Training scriptinference.py
: (Planned) Inference scriptprotein_dataset/
: Raw datasetpreprocessed_protein_dataset/
: Preprocessed datasetpairformer.py
: Pairformer modulesimplediffusion.py
: Diffusion modulerequirements.txt
: Project dependencies
Contributions and suggestions are welcome! This educational project is a simplified starting point for understanding protein folding.