Overview | |
---|---|
CI/CD | |
Code |
ORCA-python is an experimental framework built on Python that seamlessly integrates with scikit-learn and sacred modules to automate machine learning experiments through simple JSON configuration files. Initially designed for ordinal classification, it supports regular classification algorithms as long as they are compatible with scikit-learn, making it easy to run reproducible experiments across multiple datasets and classification methods.
ORCA-python requires Python 3.8 or higher and is tested on Python 3.8, 3.9, 3.10, and 3.11.
All dependencies are managed through pyproject.toml
and include:
- numpy (>=1.24.4)
- pandas (>=2.0.3)
- sacred (>=0.8.7)
- scikit-learn (>=1.3.2)
- scipy (>=1.10.1)
-
Clone the repository:
git clone https://github.com/ayrna/orca-python cd orca-python
-
Install the framework:
pip install .
For development purposes, use editable installation:
pip install -e .
Optional dependencies for development:
pip install -e .[dev]
Note: The editable mode is required for running tests due to automatic dependency resolution.
Test your installation with the provided example:
python config.py with orca_python/configurations/full_functionality_test.json -l ERROR
ORCA-python includes sample datasets with pre-partitioned train/test splits using a 30-holdout experimental design.
Basic experiment configuration:
{
"general_conf": {
"basedir": "orca_python/datasets/data",
"datasets": ["balance-scale", "contact-lenses", "tae"],
"hyperparam_cv_nfolds": 3,
"output_folder": "results/",
"metrics": ["ccr", "mae", "amae"],
"cv_metric": "mae"
},
"configurations": {
"SVM": {
"classifier": "SVC",
"parameters": {
"C": [0.001, 0.1, 1, 10, 100],
"gamma": [0.1, 1, 10]
}
},
"SVMOP": {
"classifier": "OrdinalDecomposition",
"parameters": {
"dtype": "ordered_partitions",
"decision_method": "frank_hall",
"base_classifier": "SVC",
"parameters": {
"C": [0.01, 0.1, 1, 10],
"gamma": [0.01, 0.1, 1, 10],
"probability": ["True"]
}
}
}
}
}
Run the experiment:
python config.py with my_experiment.json -l ERROR
Results are saved in results/
folder with performance metrics for each dataset-classifier combination. The framework automatically performs cross-validation, hyperparameter tuning, and evaluation on test sets.
Experiments are defined using JSON configuration files with two main sections: general_conf for experiment settings and configurations for classifier definitions.
Controls global experiment parameters.
Required parameters:
basedir
: folder containing all dataset subfolders, it doesn't allow more than one folder at a time. It can be indicated using a full path, or a relative one to the framework folder.datasets
: name of datasets that will be experimented with. A subfolder with the same name must exist insidebasedir
.
Optional parameters:
hyperparam_cv_folds
: number of folds used while cross-validating.jobs
: number of jobs used for GridSearchCV during cross-validation.input_preprocessing
: data preprocessing ("std"
for standardization,"norm"
for normalization,""
for none)output_folder
: name of the folder where all experiment results will be stored.metrics
: name of the accuracy metrics to measure the train and test performance of the classifier.cv_metric
: error measure used for GridSearchCV to find the best set of hyper-parameters.
Defines classifiers and their hyperparameters for GridSearchCV. Each configuration has a name and consists of:
classifier
: scikit-learn or built-in ORCA-python classifierparameters
: hyperparameters for grid search (nested for ensemble methods)
python config.py with experiment_file.json
For reproducible results with minimal output:
python config.py with experiment_file.json seed=12345 -l ERROR
Parameters:
seed
: fixed random seed for reproducibility-l ERROR
: reduces Sacred framework verbosity
Results are stored in the specified output folder with detailed performance metrics and hyperparameter information for each dataset and configuration combination.