Healthcare cost prediction and analysis using Python and machine learning
This project focuses on analysing individual-level healthcare cost data to identify high-cost individuals using data analytics and machine learning techniques.
The objective is to explore key drivers of healthcare costs and build predictive models that support cost control and decision-making.
The dataset contains individual-level healthcare and demographic information, including medical visits, health conditions, and associated costs.
Publicly available healthcare cost data is used for educational and analytical purposes.
The project follows an end-to-end data analysis workflow:
- Data cleaning and preprocessing using Python (Pandas, NumPy)
- Exploratory Data Analysis (EDA) to understand cost distributions and key variables
- Feature engineering to construct meaningful predictors
- Model development using regression and tree-based models
- Model evaluation and interpretation
- Python (Pandas, NumPy, scikit-learn)
- SQL (exploratory querying)
- Jupyter Notebook
- Visual Studio Code
- Git
healthcare-cost-analysis/ ├── data/ │ ├── raw/ │ └── processed/ ├── notebooks/ │ ├── 01_eda.ipynb │ ├── 02_feature_engineering.ipynb │ └── 03_modeling.ipynb ├── src/ ├── sql/ ├── results/ └── README.md
- Identification of factors associated with high healthcare costs
- Comparison of regression and tree-based models
- Actionable insights for healthcare cost analysis
- Hyperparameter tuning and model optimisation
- Incorporation of additional socioeconomic features
- Extension to time-based or longitudinal analysis
Feature Engineering
- Applied binary encoding for gender and smoking status to preserve interpretability
- Used one-hot encoding for geographic regions to avoid imposing ordinal structure
- Engineered interaction features to capture non-linear cost drivers
- Log-transformed target variable to reduce skewness and stabilize variance
The goal of the modelling stage is to evaluate whether the engineered features are informative in predicting individual healthcare costs, and to establish a robust baseline regression model with proper preprocessing.
The target variable is the log-transformed medical charges (log_charges), which helps reduce skewness and stabilise variance.
The dataset is split into training and testing sets using an 80/20 split:
- Training set: 80%
- Test set: 20%
- Random seed fixed for reproducibility
This ensures model performance is evaluated on unseen data.
A unified preprocessing pipeline is constructed using ColumnTransformer to ensure consistent feature transformations during both training and inference.
- Numerical features (
age,bmi,children,bmi_smoker) are standardised usingStandardScaler - Remaining features (e.g. one-hot encoded categorical variables) are passed through unchanged
This approach avoids data leakage and keeps preprocessing tightly coupled with the model.
Ridge Regression is used as the baseline model due to its ability to:
- Handle multicollinearity introduced by one-hot encoding
- Stabilise coefficient estimates via L2 regularisation
- Maintain interpretability of feature coefficients
The full modelling workflow is implemented using a Pipeline that combines preprocessing and model fitting into a single object.
The model is trained on the training dataset using the preprocessing + Ridge regression pipeline. This ensures all transformations are learned exclusively from the training data.
Model performance is evaluated on the test set using standard regression metrics such as:
- R² (coefficient of determination)
- Root Mean Squared Error (RMSE)
These metrics assess both explanatory power and predictive accuracy.
The Ridge model coefficients are examined to understand the direction and relative importance of engineered features, providing insights into key drivers of healthcare costs.