Telco Customer Churn Prediction

Predict whether a customer will churn (leave) using the Telco Customer Churn dataset from Kaggle.

✅ Results Summary

Dataset: Telco Customer Churn (7,043 customers)
Task: Binary classification (Churn / No Churn)
Baseline Model: Logistic Regression
Improved Model: XGBoost
Best ROC-AUC: 0.8478 (XGBoost)
Best F1-score: 0.6040 (Logistic Regression)

The project implements a complete ML pipeline including preprocessing, model comparison, evaluation, and error analysis.

📁 Project Structure

telco-churn-ml/
├── data/
│   └── telco.csv                  # Raw dataset (7,043 customers)
├── notebooks/
│   └── eda.ipynb                  # Exploratory Data Analysis
├── src/
│   ├── preprocess.py              # Data cleaning & feature pipeline
│   ├── train.py                   # Model training (baseline + improved)
│   └── evaluate.py                # Evaluation & error analysis
├── models/
│   ├── logistic_regression.pkl    # Baseline model
│   └── xgboost_model.pkl          # Improved model
├── results/
│   ├── metrics.json               # All evaluation metrics
│   ├── confusion_matrix_*.png     # Confusion matrices
│   ├── roc_curve_comparison.png   # ROC curve comparison
│   └── feature_importance.png     # XGBoost feature importances
├── requirements.txt
└── README.md

All scripts are executable independently and generate models and results automatically.

📊 Dataset

Property	Value
Source	Kaggle Telco Customer Churn
Rows	7,043
Features	20 (demographics, services, account info, charges)
Target	`Churn` (Yes / No)
Class Balance	~26.5% churn (imbalanced)

🔧 Data Preprocessing

Dropped customerID (not a feature)
Converted TotalCharges to numeric (11 whitespace entries → filled with median)
Encoded target: Churn → 0/1
Feature pipeline using ColumnTransformer:
- Numeric (tenure, MonthlyCharges, TotalCharges) → StandardScaler
- Categorical (all others) → OneHotEncoder(handle_unknown="ignore")

🔀 Train / Test Split

Parameter	Value
Method	`train_test_split` (scikit-learn)
Split Ratio	80% train / 20% test
Stratification	Yes (`stratify=y`) — essential for imbalanced target
Random State	42

🤖 Models Used

Baseline — Logistic Regression

Why? Interpretable, fast, standard baseline for tabular classification.
LogisticRegression(max_iter=1000, random_state=42)

Improved — XGBoost Classifier

Why? Handles tabular data extremely well, captures non-linear feature interactions, typically improves recall for minority class.
Tuned with regularisation (reg_alpha, reg_lambda, gamma, min_child_weight) to prevent overfitting.

XGBClassifier(
    n_estimators=200, max_depth=4, learning_rate=0.05,
    subsample=0.8, colsample_bytree=0.8,
    reg_alpha=1.0, reg_lambda=5.0,
    min_child_weight=5, gamma=0.3,
    eval_metric="logloss", random_state=42
)

📈 Metrics Reported

✅ Best Result


Best Model	XGBoost (by ROC-AUC)
ROC-AUC	0.8478
Accuracy	0.8020

Note: Logistic Regression achieves a slightly higher F1-score (0.604 vs 0.583) due to better recall, making it competitive. XGBoost leads on ROC-AUC, indicating better overall discriminative ability across all thresholds.

ROC-AUC was chosen as the primary metric because churn prediction involves imbalanced classes, and AUC better reflects model performance across different classification thresholds.

Full Comparison

Metric	Logistic Regression	XGBoost
Accuracy	0.8055	0.8020
Precision	0.6572	0.6610
Recall	0.5588	0.5214
F1-score	0.6040	0.5830
ROC-AUC	0.8419	0.8478

🔍 Error Analysis

Confusion Matrix Insights

Logistic Regression:

209 / 374 churned customers correctly identified (True Positives)
165 churned customers missed (False Negatives — business loss: these customers leave undetected)
108 non-churn customers flagged incorrectly (False Positives — unnecessary retention cost)

XGBoost:

195 / 374 churned customers correctly identified
179 churned customers missed (slightly more FN than LR)
100 non-churn flagged incorrectly (fewer false alarms)

Where Models Struggle

Medium-tenure customers (12–36 months) are hardest to classify — they fall between the clear short-tenure churners and loyal long-tenure customers.
Month-to-month contracts are frequently misclassified — high variability within this group.
High monthly charges increase churn probability, but some high-charge customers on long contracts remain loyal, confusing the models.

ROC Curve

Both models show strong discrimination (AUC > 0.84). XGBoost's curve is slightly higher in the low-FPR region, meaning it's better at identifying true churners when keeping false alarms low.

Feature Importance (XGBoost)

Top churn predictors:

tenure — strongest predictor; short tenure = high churn risk
MonthlyCharges — higher charges correlate with churn
TotalCharges — proxy for customer value
Contract (Month-to-month) — highest churn contract type
InternetService (Fiber optic) — fiber customers churn more than DSL

In real-world deployment, minimizing false negatives is critical since missed churners represent direct revenue loss. Threshold tuning or cost-sensitive learning could further improve recall.

💡 Key Insights

Insight	Detail
📅 Contract type matters most	Month-to-month contracts have ~42% churn vs <5% for 2-year contracts
⏱️ Short tenure = high risk	Customers in the first 12 months are most likely to churn
💰 Higher charges → higher churn	Median charges for churned customers are significantly higher
🛡️ Protective services help	Customers without Online Security, Tech Support, or Online Backup churn more
🌐 Fiber optic paradox	Despite being a premium service, fiber optic users churn more — possibly due to higher costs

🚀 How To Run

# 1. Install dependencies
pip install -r requirements.txt

# 2. Train models
python src/train.py

# 3. Evaluate & generate plots
python src/evaluate.py

# 4. Explore EDA notebook
jupyter notebook notebooks/eda.ipynb

All results reported in this README were generated using random_state=42 for reproducibility.

📦 Requirements

Python 3.8+
pandas, numpy, scikit-learn, xgboost, matplotlib, seaborn, joblib, jupyter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telco Customer Churn Prediction

✅ Results Summary

📁 Project Structure

📊 Dataset

🔧 Data Preprocessing

🔀 Train / Test Split

🤖 Models Used

Baseline — Logistic Regression

Improved — XGBoost Classifier

📈 Metrics Reported

✅ Best Result

Full Comparison

🔍 Error Analysis

Confusion Matrix Insights

Where Models Struggle

ROC Curve

Feature Importance (XGBoost)

💡 Key Insights

🚀 How To Run

📦 Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
models		models
notebooks		notebooks
results		results
src		src
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Telco Customer Churn Prediction

✅ Results Summary

📁 Project Structure

📊 Dataset

🔧 Data Preprocessing

🔀 Train / Test Split

🤖 Models Used

Baseline — Logistic Regression

Improved — XGBoost Classifier

📈 Metrics Reported

✅ Best Result

Full Comparison

🔍 Error Analysis

Confusion Matrix Insights

Where Models Struggle

ROC Curve

Feature Importance (XGBoost)

💡 Key Insights

🚀 How To Run

📦 Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages