Skip to content

ANUBprad/Cardiovascular-Risk-Assessment-Using-Statistical-Analysis-and-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

❤️ Cardiovascular Risk Assessment Using Statistical Analysis & Machine Learning

🚑 An end-to-end Data Science project focused on early cardiovascular disease risk identification using statistical validation and supervised machine learning.


⚖️ Disclaimer

This project is for educational and analytical purposes only and does not constitute medical advice.


📌 Overview

Cardiovascular disease is a leading cause of global mortality.
This project applies data science principles to structured healthcare data to identify key risk factors and deliver probability-based risk stratification for decision support.

Key highlights:

  • 📊 Statistical validation of risk factors
  • 🧠 Interpretable machine learning models
  • 🎯 Recall-focused evaluation for medical screening
  • 🌐 Deployment via Streamlit web application

❓ Problem Statement

Early detection of cardiovascular risk enables preventive intervention and improved outcomes.
The objective of this project is to:

  • Analyze clinical and lifestyle data
  • Validate predictors using statistical hypothesis testing
  • Build predictive models with medical-context evaluation
  • Translate predictions into actionable risk categories

🗂️ Dataset

  • 📁 Structured healthcare dataset with 4,000+ patient records
  • 🎯 Target variable: TenYearCHD (10-year cardiovascular disease risk)
  • 🧬 Features include demographics, medical history, and physiological measurements

Raw data is preserved and all preprocessing steps are fully reproducible.


📊 Exploratory Data Analysis (EDA)

Key insights from EDA:

  • 📈 Age and systolic blood pressure strongly correlate with disease risk
  • 🧪 Cholesterol shows higher variability among high-risk individuals
  • 🚬 Lifestyle factors such as smoking contribute meaningfully
  • 🔗 Multiple features exhibit moderate to strong correlation with the target

Detailed insights are available in:

  1. reports/eda_insights.md
## 📐 Statistical Hypothesis Testing
To validate EDA findings, formal statistical tests were conducted:

- ✔ Two-sample **t-tests** for normally distributed variables
- ✔ **Mann–Whitney U tests** for non-parametric comparisons

Results confirm statistically significant differences in:
- Age
- Cholesterol
- Heart rate

  1. reports/statistical_test_result.md
## 🧪 Feature Engineering
Feature engineering steps included:
- Handling missing values using median imputation
- Creating age-based buckets for non-linear risk capture
- Engineering interaction terms (e.g., age × cholesterol)
- Encoding categorical variables for modeling consistency

🤖 Modeling Approach

Supervised learning models trained and evaluated:

  • Logistic Regression (interpretable baseline)
  • Random Forest
  • Gradient Boosting

📏 Evaluation Metrics

Given the medical screening context:

  • 🔁 Recall was prioritized to minimize false negatives
  • 📊 Precision, ROC-AUC, and confusion matrix were also used

🔍 Model Interpretability

  • 📉 Logistic Regression coefficients analyzed for feature influence
  • 🌳 Tree-based feature importance used to identify dominant predictors

Consistently strong indicators included age, blood pressure, cholesterol, and smoking-related features.


🧭 Risk Stratification

Predicted probabilities were converted into:

  • 🟢 Low Risk
  • 🟡 Medium Risk
  • 🔴 High Risk

This enables actionable decision support beyond raw probability scores.


🛠️ Tech Stack

  • Python
  • Pandas, NumPy
  • Scikit-learn, SciPy
  • Streamlit

🌐 Deployment

A Streamlit-based web application enables:

  • User-friendly health input collection
  • Real-time risk prediction
  • Probability-based risk categorization

This bridges the gap between modeling and real-world usability.

🚀 Future Enhancements

  • Integrate SHAP for instance-level explainability
  • Hyperparameter tuning with cross-validation
  • Expand dataset diversity
  • Add longitudinal risk tracking

👤 Author

Anubhab Pradhan
🔗 GitHub: https://github.com/ANUBprad

About

End-to-end cardiovascular risk assessment system using statistical analysis and machine learning. Includes EDA, hypothesis testing, feature engineering, recall-focused modeling, interpretable results, and a Streamlit web app for probability-based risk stratification and decision support.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors