Predicting Heart Disease

Binary classification of heart-disease risk on the Kaggle Heart Failure Prediction dataset (n=918), comparing six models from K-NN to a 2-hidden-layer Keras neural network.

🎯 Objective

Given 11 clinical features (age, chest-pain type, resting BP, cholesterol, max HR, ST-slope, etc.), predict whether a patient has heart disease (HeartDisease ∈ {0, 1}). The project explores how a small, structured medical tabular dataset is best modeled — comparing classic baselines (K-NN, Decision Tree), tree ensembles (Random Forest, Gradient Boosting, XGBoost), and a neural network — to see whether the deep model is worth the extra complexity.

📊 Results

Test accuracy on a held-out 20% split (~184 samples). All non-NN models tuned with GridSearchCV; NN tuned over L2 regularization and dropout.

Model	Test Accuracy	AUC-ROC
Decision Tree	81.88%	0.864
Gradient Boosting	83.33%	0.910
XGBoost	83.33%	0.895
K-NN	84.06%	0.888
Random Forest	84.78%	0.905
Neural Net (2 hidden layers, Keras)	88.59%	—

The 2-layer Keras NN with StandardScaler + L2 regularization + dropout is the top scorer at 88.59% accuracy and a 0.88 weighted F1 on the hold-out set. Among the grid-searched non-NN models, Random Forest (84.78%) and Gradient Boosting (AUC 0.910) lead.

🏗️ Methodology

Load: Kaggle Heart Failure Prediction Dataset via kagglehub — 918 rows × 12 columns.
EDA: distributions, descriptive statistics, missingness check (none).
Clean:
- Drop the single RestingBP = 0 row (clinically impossible).
- Impute Cholesterol = 0 (172 rows, also clinically impossible) with the median, conditioned on HeartDisease.
Features: one-hot encode categoricals, inspect correlations, rank with Random-Forest feature_importances_, sanity-check with Partial Dependence Plots.
Model:
- K-NN with MinMaxScaler + GridSearchCV over k.
- Decision Tree, Random Forest, Gradient Boosting, XGBoost — each tuned via GridSearchCV over depth / leaves / learning rate / n_estimators.
- MLPClassifier (two hidden layers, 100 + 50).
- Keras Sequential: 2 hidden layers + BatchNormalization + Dropout, with a small grid over L2 (1e-4, 1e-3, 1e-2) and dropout (0.2, 0.3).
Evaluate: accuracy, MSE, AUC-ROC, sensitivity, specificity, precision, F1 on a single 80/20 hold-out.

🛠️ Tech Stack

Language: Python 3.10+
ML / stats: scikit-learn, XGBoost, imbalanced-learn
Deep learning: TensorFlow / Keras
Data: pandas, NumPy, kagglehub
Plots: matplotlib, seaborn, tabulate

📁 Repository Structure

Predicting-Heart-Disease/
├── Predicting_Heart_Disease_1.ipynb   # End-to-end notebook (EDA → cleaning → 6 models)
├── assets/
│   ├── model_comparison.png            # Test-accuracy bar chart (used in this README)
│   └── generate_charts.py              # Reproducible chart from notebook results
└── README.md

🚀 How to Run

pip install pandas scikit-learn matplotlib seaborn kagglehub tensorflow imbalanced-learn xgboost tabulate
jupyter notebook Predicting_Heart_Disease_1.ipynb

The notebook downloads the dataset via kagglehub (you may need Kaggle credentials configured).

To regenerate the README chart from the recorded numbers:

python assets/generate_charts.py

📝 Notes / Limitations

Small dataset (n=918). A single random 80/20 split is fragile for ranking models that are within ~3 points of each other. The NN's 4-point edge over Random Forest is suggestive, not statistically significant — repeated k-fold CV would tell a stronger story.
No clinical validation. This is an academic notebook on a public dataset, not a deployed or peer-reviewed clinical tool. Don't use it for actual diagnosis.
Class imbalance is mild but unaddressed. Roughly 55% positive / 45% negative on the hold-out; imbalanced-learn is in requirements but isn't actually used in the final pipeline.
The MLPClassifier saw two runs in the notebook (first on un-scaled features, second after standardization). Only the standardized variant is competitive. The chart cites the Keras 2-layer NN, which is the strongest deep variant.
Cholesterol=0 imputation is conservative. Imputing with the median, conditioned on HeartDisease, leaks information from the target into training features — a clean re-run would impute on the train set only and apply that constant to test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Heart Disease

🎯 Objective

📊 Results

🏗️ Methodology

🛠️ Tech Stack

📁 Repository Structure

🚀 How to Run

📝 Notes / Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
Predicting_Heart_Disease_1.ipynb		Predicting_Heart_Disease_1.ipynb
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Predicting Heart Disease

🎯 Objective

📊 Results

🏗️ Methodology

🛠️ Tech Stack

📁 Repository Structure

🚀 How to Run

📝 Notes / Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages