Binary classification of heart-disease risk on the Kaggle Heart Failure Prediction dataset (n=918), comparing six models from K-NN to a 2-hidden-layer Keras neural network.
Given 11 clinical features (age, chest-pain type, resting BP, cholesterol, max HR, ST-slope, etc.), predict whether a patient has heart disease (HeartDisease ∈ {0, 1}). The project explores how a small, structured medical tabular dataset is best modeled — comparing classic baselines (K-NN, Decision Tree), tree ensembles (Random Forest, Gradient Boosting, XGBoost), and a neural network — to see whether the deep model is worth the extra complexity.
Test accuracy on a held-out 20% split (~184 samples). All non-NN models tuned with GridSearchCV; NN tuned over L2 regularization and dropout.
| Model | Test Accuracy | AUC-ROC |
|---|---|---|
| Decision Tree | 81.88% | 0.864 |
| Gradient Boosting | 83.33% | 0.910 |
| XGBoost | 83.33% | 0.895 |
| K-NN | 84.06% | 0.888 |
| Random Forest | 84.78% | 0.905 |
| Neural Net (2 hidden layers, Keras) | 88.59% | — |
The 2-layer Keras NN with StandardScaler + L2 regularization + dropout is the top scorer at 88.59% accuracy and a 0.88 weighted F1 on the hold-out set. Among the grid-searched non-NN models, Random Forest (84.78%) and Gradient Boosting (AUC 0.910) lead.
- Load: Kaggle Heart Failure Prediction Dataset via
kagglehub— 918 rows × 12 columns. - EDA: distributions, descriptive statistics, missingness check (none).
- Clean:
- Drop the single
RestingBP = 0row (clinically impossible). - Impute
Cholesterol = 0(172 rows, also clinically impossible) with the median, conditioned onHeartDisease.
- Drop the single
- Features: one-hot encode categoricals, inspect correlations, rank with Random-Forest
feature_importances_, sanity-check with Partial Dependence Plots. - Model:
- K-NN with
MinMaxScaler+GridSearchCVoverk. - Decision Tree, Random Forest, Gradient Boosting, XGBoost — each tuned via
GridSearchCVover depth / leaves / learning rate / n_estimators. - MLPClassifier (two hidden layers, 100 + 50).
- Keras Sequential: 2 hidden layers +
BatchNormalization+Dropout, with a small grid over L2 (1e-4,1e-3,1e-2) and dropout (0.2,0.3).
- K-NN with
- Evaluate: accuracy, MSE, AUC-ROC, sensitivity, specificity, precision, F1 on a single 80/20 hold-out.
- Language: Python 3.10+
- ML / stats: scikit-learn, XGBoost, imbalanced-learn
- Deep learning: TensorFlow / Keras
- Data: pandas, NumPy, kagglehub
- Plots: matplotlib, seaborn, tabulate
Predicting-Heart-Disease/
├── Predicting_Heart_Disease_1.ipynb # End-to-end notebook (EDA → cleaning → 6 models)
├── assets/
│ ├── model_comparison.png # Test-accuracy bar chart (used in this README)
│ └── generate_charts.py # Reproducible chart from notebook results
└── README.md
pip install pandas scikit-learn matplotlib seaborn kagglehub tensorflow imbalanced-learn xgboost tabulate
jupyter notebook Predicting_Heart_Disease_1.ipynbThe notebook downloads the dataset via kagglehub (you may need Kaggle credentials configured).
To regenerate the README chart from the recorded numbers:
python assets/generate_charts.py- Small dataset (n=918). A single random 80/20 split is fragile for ranking models that are within ~3 points of each other. The NN's 4-point edge over Random Forest is suggestive, not statistically significant — repeated k-fold CV would tell a stronger story.
- No clinical validation. This is an academic notebook on a public dataset, not a deployed or peer-reviewed clinical tool. Don't use it for actual diagnosis.
- Class imbalance is mild but unaddressed. Roughly 55% positive / 45% negative on the hold-out;
imbalanced-learnis inrequirementsbut isn't actually used in the final pipeline. - The MLPClassifier saw two runs in the notebook (first on un-scaled features, second after standardization). Only the standardized variant is competitive. The chart cites the Keras 2-layer NN, which is the strongest deep variant.
- Cholesterol=0 imputation is conservative. Imputing with the median, conditioned on
HeartDisease, leaks information from the target into training features — a clean re-run would impute on the train set only and apply that constant to test.
