Skip to content

DhaneshKolu/algorithm-selection-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Algorithm Selection Engine using Meta-Learning

Overview

Choosing the right machine learning algorithm for a dataset is often done through trial and error.
This project explores a meta-learning approach to automatically recommend the most suitable algorithm based on dataset characteristics, instead of relying on manual experimentation.

The system learns from multiple datasets by observing which algorithm performs best under different data conditions and then generalizes this knowledge to unseen datasets.


Problem Statement

Given a classification dataset,
can we predict which ML algorithm (Logistic Regression, Decision Tree, or KNN) will perform best without training all of them?

This project treats algorithm selection as a supervised learning problem, where:

  • Each dataset is a data point
  • Dataset properties are features
  • The best-performing algorithm is the label

Project Structure

algorithm-selection-engine/ │ ├── src/ │ ├── data_generation.py # Synthetic dataset generator │ ├── meta_features.py # Dataset-level feature extraction │ ├── evaluate_models.py # Fair algorithm evaluation │ ├── meta_model.py # Meta-learning model │ └── utils.py │ ├── experiments/ │ └── run_experiment.py # End-to-end pipeline│ ├── data/ ├── README.md ├── requirements.txt └── .gitignore

Approach

1. Synthetic Dataset Generation

Since algorithm behavior must be observed across diverse datasets, synthetic classification datasets are generated with:

  • Binary and multiclass labels
  • Mild class imbalance
  • Controlled noise
  • Variable sample size and feature count

This allows systematic experimentation while keeping assumptions explicit.


2. Meta-Feature Extraction

Each dataset is summarized using dataset-level characteristics, including:

  • Number of samples
  • Number of features
  • Number of classes
  • Class imbalance ratio
  • Mean feature variance
  • Mean feature correlation
  • Label entropy

These features describe the nature of the dataset, not individual samples.


3. Algorithm Evaluation

For each dataset:

  • Logistic Regression
  • Decision Tree
  • K-Nearest Neighbors

are evaluated using Stratified K-Fold cross-validation and macro F1-score, which is appropriate for imbalanced and multiclass data.

The algorithm with the highest average F1 score is selected as the ground-truth best algorithm for that dataset.


4. Meta-Model Training

A Decision Tree classifier is trained as a meta-model where:

  • Input → meta-features of a dataset
  • Output → best algorithm label

The decision tree provides interpretability and highlights how dataset properties influence algorithm choice.


5. Prediction on Unseen Datasets

Once trained, the meta-model predicts the best algorithm for new, unseen datasets, demonstrating generalization beyond the training data.


Results

  • The system successfully learns patterns linking dataset properties to algorithm performance.
  • Different datasets favor different algorithms depending on size, class structure, and feature relationships.
  • The end-to-end pipeline runs without manual intervention, from dataset generation to algorithm recommendation.

Achievements

  • Designed a complete meta-learning pipeline from scratch
  • Applied cross-validation and proper metrics for fair comparison
  • Handled imbalanced and multiclass classification
  • Built a professional, modular project structure
  • Gained practical understanding of:
    • Bias–variance trade-off
    • Algorithm suitability
    • Dataset-driven decision making

Limitations

  • The meta-dataset is relatively small, which can cause some algorithms to be underrepresented.
  • The meta-model may not predict rare optimal algorithms consistently.
  • Synthetic datasets may not capture all complexities of real-world data.
  • Hyperparameter tuning is intentionally minimal to keep focus on algorithm selection logic.

These limitations are expected and reflect real challenges in AutoML systems.


Future Improvements

  • Increase number and diversity of datasets
  • Add missing value simulation and imputation strategies
  • Include more algorithms (e.g., Random Forest, SVM)
  • Balance the meta-dataset across algorithm classes
  • Extend the system to regression problems

Key Learning

This project emphasizes that algorithm choice depends on data properties, not just model complexity.
By learning from datasets instead of samples, the system demonstrates a simplified but realistic view of how automated ML systems reason in practice.


Technologies Used

  • Python
  • NumPy
  • scikit-learn

Author

Built as part of hands-on learning after completing the Andrew Ng Machine Learning Specialization, with a focus on industry-relevant ML reasoning rather than tutorial-driven implementation.

About

Meta-learning project to predict the best ML algorithm based on dataset characteristics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages