This project implements a robust machine learning pipeline to classify red wine quality using Random Forests, RFECV for feature selection, and nested cross-validation for model evaluation and hyperparameter tuning. The entire process uses the Wine Quality Dataset from the UCI Machine Learning Repository.
To build a binary classification model that predicts whether a red wine is of good quality (≥6) or not (<6), using Random Forests while optimizing for model robustness, feature selection, and generalization performance.
We use the winequality-red.csv dataset, which contains 1599 samples of red wine with 11 physicochemical input variables:
fixed acidity,volatile acidity,citric acid,residual sugar,chloridesfree sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
Target variable:
quality(integer score between 3 and 8)
The dataset is converted to binary classification:
- Quality ≥ 6 → Good wine (1)
- Quality < 6 → Bad wine (0)
This includes:
- Missing value check
- Descriptive statistics
- Class distribution visualization
- Feature-target bar plots
- Heatmap of correlation matrix
We use a nested cross-validation strategy with Random Forests for both:
- Feature selection via
RFECV - Final prediction via
RandomForestClassifier
- Uses Random Forest to select the top
kfeatures from the 11 available. - Evaluated using internal 5-fold CV on training folds.
Performs hyperparameter tuning with:
param_grid = {
'n_estimators': [100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
'bootstrap': [True]
}- 5-fold stratified cross-validation is used to ensure robust generalization testing.
- Prevents data leakage between feature selection and model evaluation by performing feature selection within each training fold.
For each fold and feature set size, the following metrics are computed:
- Accuracy
- Precision (weighted)
- Recall (weighted)
- F1 Score (weighted)
- ROC AUC
- Specificity (TNR)
Two Excel files are saved after evaluation:
train_results_RF_Wine.xlsx— Metrics on training foldstest_results_RF_Wine.xlsx— Metrics on test/validation folds
Each file contains the following columns: Top Features, Selected Features, Fold, Accuracy, Precision, Recall, F1 Score, AUC, Specificity, Best Parameters
- No leakage via nested CV
- Hyperparameter optimization inside each outer fold
- Model interpretability through feature ranking
- Reproducibility with fixed random seeds