Machine Learning for Adaptation Gene Prediction

This project uses machine learning to predict whether a gene from the bacterium Deinococcus radiodurans contributes to stress tolerance based on features derived purely from its DNA sequence. The goal is to create a computational tool that can rapidly screen genomes for candidate stress-response genes, potentially aiding in bioengineering and synthetic biology.

Project Goal

The primary objective is to build and train a machine learning model that can classify a given gene as either a "stress-response gene" or a "normal/housekeeping gene." Instead of relying on expensive and time-consuming laboratory experiments, this model leverages patterns in the DNA sequence itself to make predictions.

This serves as a proof-of-concept for a high-throughput screening tool to prioritize genes for further study in newly sequenced organisms, especially extremophiles.

How It Works

The project follows a classic supervised learning approach:

Data Collection: The complete annotated genome of Deinococcus radiodurans (.gbff format) is downloaded from the NCBI database.
Labeling: Genes are assigned one of two labels:
- Stress (Positive Class): A list is created using a hybrid strategy of (a) manually curated, literature-verified stress genes and (b) programmatic searching for functional keywords like "DNA repair," "radiation resistance," "chaperone," etc.
- Control (Negative Class): A list of housekeeping genes is created using a similar strategy, identifying genes for essential functions like "ribosomal protein" or "gyrase."
Feature Engineering: Each gene's DNA sequence is converted into a set of meaningful numerical features that the model can understand.
Model Training: An XGBoost classifier is trained on the labeled, feature-engineered dataset to learn the patterns that differentiate the two classes.

Features Engineered

The model uses a rich set of features to build its predictions:

Basic Sequence Properties:
- Gene Length: Total length in base pairs.
- GC Content: Percentage of Guanine and Cytosine bases.
- GC3 Content: GC content at the third position of codons.
- CG Dinucleotide Frequency: The relative abundance of CG pairs.
Protein-Level Features:
- Hydrophobicity (GRAVY): The Grand Average of Hydropathicity of the translated protein.
- Isoelectric Point: The pH at which the translated protein has no net charge.
Pattern-Based Features:
- Motif Frequencies: Occurrence of known regulatory motifs.
- K-mer Frequencies: A high-resolution sequence "fingerprint" based on the frequency of all possible 4-letter DNA substrings (e.g., 'AAGC', 'GATA').

Machine Learning Pipeline

A robust pipeline ensures the model is trained correctly and its performance is reliable:

Preprocessing (VarianceThreshold): Automatically removes useless features that are constant across all samples.
Feature Selection (SelectKBest): Selects the top 100 most informative features using a statistical ANOVA F-test, reducing noise and complexity.
Handling Class Imbalance (SMOTE): Artificially balances the training data by creating synthetic examples of the rare "Stress" class, preventing the model from becoming biased.
Hyperparameter Tuning (GridSearchCV): Systematically tests different model configurations (e.g., learning_rate, max_depth) to find the optimal settings.
Training (XGBoost): The final model is an XGBoost classifier, a powerful gradient boosting algorithm well-suited for tabular data.

How to Use This Project

Prerequisites

Python 3.8+
Jupyter Notebook or JupyterLab

Installation

Clone this repository to your local machine:

git clone <your-repository-url>
cd <repository-directory>

Install the required Python libraries using pip:

pip install biopython scikit-learn pandas xgboost matplotlib imblearn joblib

Running the Notebook

Launch Jupyter Notebook or JupyterLab:
```
jupyter notebook
```
Open the main notebook file (e.g., gene_prediction_pipeline.ipynb).
Execute the cells in order from top to bottom. The notebook is self-contained and will automatically:
- Download the necessary genome data.
- Perform all data processing and training steps.
- Save the final model and all required pipeline components.
- Demonstrate how to load the saved artifacts and make predictions on sample data.

File Descriptions

.
├── gene_prediction_pipeline.ipynb    # Main Jupyter Notebook with all the code.
├── README.md                         # This README file.
└── GCF_000012145.1_ASM1214v1_genomic.gbff # Genome data (downloaded by the notebook).
└── saved_artifacts/
    ├── stress_gene_model.joblib      # The final, trained XGBoost model.
    ├── feature_selector.joblib       # The fitted SelectKBest object.
    ├── variance_thresholder.joblib   # The fitted VarianceThreshold object.
    ├── kmer_vectorizer.joblib        # The fitted CountVectorizer for k-mers.
    └── feature_columns.joblib        # The list of feature names the model expects.

Example Prediction Workflow

After running the main notebook once, you can use the saved artifacts to predict on a new DNA sequence:

Load Artifacts: Load the model, selector, thresholder, vectorizer, and column list using joblib.
Process New Sequence: Apply the exact same feature engineering steps to your new sequence.
Align Features: Use .reindex() to ensure the new feature vector has the same columns in the same order as the training data.
Apply Preprocessors: Transform the data using the loaded variance_thresholder and feature_selector.
Predict: Use loaded_model.predict() to get the final classification.

An example of this entire workflow is provided in the final section of the main Jupyter Notebook.

Future Work

Expand the Dataset: Incorporate data from other extremophiles (e.g., tardigrades, thermophiles) to create a more general and robust model.
Explore More Features: Engineer additional features, such as codon adaptation index (CAI) or predicted protein secondary structure.
Try Different Models: Experiment with other algorithms like LightGBM or deep learning models (e.g., Convolutional Neural Networks) to compare performance.
Deploy as a Web App: Package the model and prediction pipeline into a simple web application where users can paste a DNA sequence and get a prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
GC_CONTENT.ipynb		GC_CONTENT.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Learning for Adaptation Gene Prediction

Table of Contents

Project Goal

How It Works

Features Engineered

Machine Learning Pipeline

How to Use This Project

Prerequisites

Installation

Running the Notebook

File Descriptions

Example Prediction Workflow

Future Work

About

Uh oh!

Releases

Packages

Languages

License

thearpankumar/Adaptation-Gene-Prediction

Folders and files

Latest commit

History

Repository files navigation

Machine Learning for Adaptation Gene Prediction

Table of Contents

Project Goal

How It Works

Features Engineered

Machine Learning Pipeline

How to Use This Project

Prerequisites

Installation

Running the Notebook

File Descriptions

Example Prediction Workflow

Future Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages