Algorithm Selection Engine using Meta-Learning

Overview

Choosing the right machine learning algorithm for a dataset is often done through trial and error.
This project explores a meta-learning approach to automatically recommend the most suitable algorithm based on dataset characteristics, instead of relying on manual experimentation.

The system learns from multiple datasets by observing which algorithm performs best under different data conditions and then generalizes this knowledge to unseen datasets.

Problem Statement

Given a classification dataset,
can we predict which ML algorithm (Logistic Regression, Decision Tree, or KNN) will perform best without training all of them?

This project treats algorithm selection as a supervised learning problem, where:

Each dataset is a data point
Dataset properties are features
The best-performing algorithm is the label

Project Structure

algorithm-selection-engine/ │ ├── src/ │ ├── data_generation.py # Synthetic dataset generator │ ├── meta_features.py # Dataset-level feature extraction │ ├── evaluate_models.py # Fair algorithm evaluation │ ├── meta_model.py # Meta-learning model │ └── utils.py │ ├── experiments/ │ └── run_experiment.py # End-to-end pipeline│ ├── data/ ├── README.md ├── requirements.txt └── .gitignore

Approach

1. Synthetic Dataset Generation

Since algorithm behavior must be observed across diverse datasets, synthetic classification datasets are generated with:

Binary and multiclass labels
Mild class imbalance
Controlled noise
Variable sample size and feature count

This allows systematic experimentation while keeping assumptions explicit.

2. Meta-Feature Extraction

Each dataset is summarized using dataset-level characteristics, including:

Number of samples
Number of features
Number of classes
Class imbalance ratio
Mean feature variance
Mean feature correlation
Label entropy

These features describe the nature of the dataset, not individual samples.

3. Algorithm Evaluation

For each dataset:

Logistic Regression
Decision Tree
K-Nearest Neighbors

are evaluated using Stratified K-Fold cross-validation and macro F1-score, which is appropriate for imbalanced and multiclass data.

The algorithm with the highest average F1 score is selected as the ground-truth best algorithm for that dataset.

4. Meta-Model Training

A Decision Tree classifier is trained as a meta-model where:

Input → meta-features of a dataset
Output → best algorithm label

The decision tree provides interpretability and highlights how dataset properties influence algorithm choice.

5. Prediction on Unseen Datasets

Once trained, the meta-model predicts the best algorithm for new, unseen datasets, demonstrating generalization beyond the training data.

Results

The system successfully learns patterns linking dataset properties to algorithm performance.
Different datasets favor different algorithms depending on size, class structure, and feature relationships.
The end-to-end pipeline runs without manual intervention, from dataset generation to algorithm recommendation.

Achievements

Designed a complete meta-learning pipeline from scratch
Applied cross-validation and proper metrics for fair comparison
Handled imbalanced and multiclass classification
Built a professional, modular project structure
Gained practical understanding of:
- Bias–variance trade-off
- Algorithm suitability
- Dataset-driven decision making

Limitations

The meta-dataset is relatively small, which can cause some algorithms to be underrepresented.
The meta-model may not predict rare optimal algorithms consistently.
Synthetic datasets may not capture all complexities of real-world data.
Hyperparameter tuning is intentionally minimal to keep focus on algorithm selection logic.

These limitations are expected and reflect real challenges in AutoML systems.

Future Improvements

Increase number and diversity of datasets
Add missing value simulation and imputation strategies
Include more algorithms (e.g., Random Forest, SVM)
Balance the meta-dataset across algorithm classes
Extend the system to regression problems

Key Learning

This project emphasizes that algorithm choice depends on data properties, not just model complexity.
By learning from datasets instead of samples, the system demonstrates a simplified but realistic view of how automated ML systems reason in practice.

Technologies Used

Python
NumPy
scikit-learn

Author

Built as part of hands-on learning after completing the Andrew Ng Machine Learning Specialization, with a focus on industry-relevant ML reasoning rather than tutorial-driven implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Algorithm Selection Engine using Meta-Learning

Overview

Problem Statement

Project Structure

Approach

1. Synthetic Dataset Generation

2. Meta-Feature Extraction

3. Algorithm Evaluation

4. Meta-Model Training

5. Prediction on Unseen Datasets

Results

Achievements

Limitations

Future Improvements

Key Learning

Technologies Used

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
experiments		experiments
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Algorithm Selection Engine using Meta-Learning

Overview

Problem Statement

Project Structure

Approach

1. Synthetic Dataset Generation

2. Meta-Feature Extraction

3. Algorithm Evaluation

4. Meta-Model Training

5. Prediction on Unseen Datasets

Results

Achievements

Limitations

Future Improvements

Key Learning

Technologies Used

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages