Skip to content

CrSamson/Predicting-Heart-Disease

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Predicting Heart Disease

Binary classification of heart-disease risk on the Kaggle Heart Failure Prediction dataset (n=918), comparing six models from K-NN to a 2-hidden-layer Keras neural network.

🎯 Objective

Given 11 clinical features (age, chest-pain type, resting BP, cholesterol, max HR, ST-slope, etc.), predict whether a patient has heart disease (HeartDisease ∈ {0, 1}). The project explores how a small, structured medical tabular dataset is best modeled — comparing classic baselines (K-NN, Decision Tree), tree ensembles (Random Forest, Gradient Boosting, XGBoost), and a neural network — to see whether the deep model is worth the extra complexity.

📊 Results

Test accuracy on a held-out 20% split (~184 samples). All non-NN models tuned with GridSearchCV; NN tuned over L2 regularization and dropout.

Model comparison

Model Test Accuracy AUC-ROC
Decision Tree 81.88% 0.864
Gradient Boosting 83.33% 0.910
XGBoost 83.33% 0.895
K-NN 84.06% 0.888
Random Forest 84.78% 0.905
Neural Net (2 hidden layers, Keras) 88.59%

The 2-layer Keras NN with StandardScaler + L2 regularization + dropout is the top scorer at 88.59% accuracy and a 0.88 weighted F1 on the hold-out set. Among the grid-searched non-NN models, Random Forest (84.78%) and Gradient Boosting (AUC 0.910) lead.

🏗️ Methodology

  1. Load: Kaggle Heart Failure Prediction Dataset via kagglehub — 918 rows × 12 columns.
  2. EDA: distributions, descriptive statistics, missingness check (none).
  3. Clean:
    • Drop the single RestingBP = 0 row (clinically impossible).
    • Impute Cholesterol = 0 (172 rows, also clinically impossible) with the median, conditioned on HeartDisease.
  4. Features: one-hot encode categoricals, inspect correlations, rank with Random-Forest feature_importances_, sanity-check with Partial Dependence Plots.
  5. Model:
    • K-NN with MinMaxScaler + GridSearchCV over k.
    • Decision Tree, Random Forest, Gradient Boosting, XGBoost — each tuned via GridSearchCV over depth / leaves / learning rate / n_estimators.
    • MLPClassifier (two hidden layers, 100 + 50).
    • Keras Sequential: 2 hidden layers + BatchNormalization + Dropout, with a small grid over L2 (1e-4, 1e-3, 1e-2) and dropout (0.2, 0.3).
  6. Evaluate: accuracy, MSE, AUC-ROC, sensitivity, specificity, precision, F1 on a single 80/20 hold-out.

🛠️ Tech Stack

  • Language: Python 3.10+
  • ML / stats: scikit-learn, XGBoost, imbalanced-learn
  • Deep learning: TensorFlow / Keras
  • Data: pandas, NumPy, kagglehub
  • Plots: matplotlib, seaborn, tabulate

📁 Repository Structure

Predicting-Heart-Disease/
├── Predicting_Heart_Disease_1.ipynb   # End-to-end notebook (EDA → cleaning → 6 models)
├── assets/
│   ├── model_comparison.png            # Test-accuracy bar chart (used in this README)
│   └── generate_charts.py              # Reproducible chart from notebook results
└── README.md

🚀 How to Run

pip install pandas scikit-learn matplotlib seaborn kagglehub tensorflow imbalanced-learn xgboost tabulate
jupyter notebook Predicting_Heart_Disease_1.ipynb

The notebook downloads the dataset via kagglehub (you may need Kaggle credentials configured).

To regenerate the README chart from the recorded numbers:

python assets/generate_charts.py

📝 Notes / Limitations

  • Small dataset (n=918). A single random 80/20 split is fragile for ranking models that are within ~3 points of each other. The NN's 4-point edge over Random Forest is suggestive, not statistically significant — repeated k-fold CV would tell a stronger story.
  • No clinical validation. This is an academic notebook on a public dataset, not a deployed or peer-reviewed clinical tool. Don't use it for actual diagnosis.
  • Class imbalance is mild but unaddressed. Roughly 55% positive / 45% negative on the hold-out; imbalanced-learn is in requirements but isn't actually used in the final pipeline.
  • The MLPClassifier saw two runs in the notebook (first on un-scaled features, second after standardization). Only the standardized variant is competitive. The chart cites the Keras 2-layer NN, which is the strongest deep variant.
  • Cholesterol=0 imputation is conservative. Imputing with the median, conditioned on HeartDisease, leaks information from the target into training features — a clean re-run would impute on the train set only and apply that constant to test.

About

Modèles de machine learning pour prédire les maladies cardiaques à partir de données cliniques. Comparaison de K-NN, arbres de décision, Random Forest, Gradient Boosting, XGBoost et réseaux de neurones.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors