Skip to content

hemish22/ICSSR-Track-1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Telco Customer Churn Prediction

Predict whether a customer will churn (leave) using the Telco Customer Churn dataset from Kaggle.

✅ Results Summary

  • Dataset: Telco Customer Churn (7,043 customers)
  • Task: Binary classification (Churn / No Churn)
  • Baseline Model: Logistic Regression
  • Improved Model: XGBoost
  • Best ROC-AUC: 0.8478 (XGBoost)
  • Best F1-score: 0.6040 (Logistic Regression)

The project implements a complete ML pipeline including preprocessing, model comparison, evaluation, and error analysis.


📁 Project Structure

telco-churn-ml/
├── data/
│   └── telco.csv                  # Raw dataset (7,043 customers)
├── notebooks/
│   └── eda.ipynb                  # Exploratory Data Analysis
├── src/
│   ├── preprocess.py              # Data cleaning & feature pipeline
│   ├── train.py                   # Model training (baseline + improved)
│   └── evaluate.py                # Evaluation & error analysis
├── models/
│   ├── logistic_regression.pkl    # Baseline model
│   └── xgboost_model.pkl          # Improved model
├── results/
│   ├── metrics.json               # All evaluation metrics
│   ├── confusion_matrix_*.png     # Confusion matrices
│   ├── roc_curve_comparison.png   # ROC curve comparison
│   └── feature_importance.png     # XGBoost feature importances
├── requirements.txt
└── README.md

All scripts are executable independently and generate models and results automatically.


📊 Dataset

Property Value
Source Kaggle Telco Customer Churn
Rows 7,043
Features 20 (demographics, services, account info, charges)
Target Churn (Yes / No)
Class Balance ~26.5% churn (imbalanced)

🔧 Data Preprocessing

  1. Dropped customerID (not a feature)
  2. Converted TotalCharges to numeric (11 whitespace entries → filled with median)
  3. Encoded target: Churn → 0/1
  4. Feature pipeline using ColumnTransformer:
    • Numeric (tenure, MonthlyCharges, TotalCharges) → StandardScaler
    • Categorical (all others) → OneHotEncoder(handle_unknown="ignore")

🔀 Train / Test Split

Parameter Value
Method train_test_split (scikit-learn)
Split Ratio 80% train / 20% test
Stratification Yes (stratify=y) — essential for imbalanced target
Random State 42

🤖 Models Used

Baseline — Logistic Regression

  • Why? Interpretable, fast, standard baseline for tabular classification.
  • LogisticRegression(max_iter=1000, random_state=42)

Improved — XGBoost Classifier

  • Why? Handles tabular data extremely well, captures non-linear feature interactions, typically improves recall for minority class.
  • Tuned with regularisation (reg_alpha, reg_lambda, gamma, min_child_weight) to prevent overfitting.
XGBClassifier(
    n_estimators=200, max_depth=4, learning_rate=0.05,
    subsample=0.8, colsample_bytree=0.8,
    reg_alpha=1.0, reg_lambda=5.0,
    min_child_weight=5, gamma=0.3,
    eval_metric="logloss", random_state=42
)

📈 Metrics Reported

✅ Best Result

Best Model XGBoost (by ROC-AUC)
ROC-AUC 0.8478
Accuracy 0.8020

Note: Logistic Regression achieves a slightly higher F1-score (0.604 vs 0.583) due to better recall, making it competitive. XGBoost leads on ROC-AUC, indicating better overall discriminative ability across all thresholds.

ROC-AUC was chosen as the primary metric because churn prediction involves imbalanced classes, and AUC better reflects model performance across different classification thresholds.

Full Comparison

Metric Logistic Regression XGBoost
Accuracy 0.8055 0.8020
Precision 0.6572 0.6610
Recall 0.5588 0.5214
F1-score 0.6040 0.5830
ROC-AUC 0.8419 0.8478

🔍 Error Analysis

Confusion Matrix Insights

Logistic Regression:

  • 209 / 374 churned customers correctly identified (True Positives)
  • 165 churned customers missed (False Negatives — business loss: these customers leave undetected)
  • 108 non-churn customers flagged incorrectly (False Positives — unnecessary retention cost)

XGBoost:

  • 195 / 374 churned customers correctly identified
  • 179 churned customers missed (slightly more FN than LR)
  • 100 non-churn flagged incorrectly (fewer false alarms)

Where Models Struggle

  • Medium-tenure customers (12–36 months) are hardest to classify — they fall between the clear short-tenure churners and loyal long-tenure customers.
  • Month-to-month contracts are frequently misclassified — high variability within this group.
  • High monthly charges increase churn probability, but some high-charge customers on long contracts remain loyal, confusing the models.

ROC Curve

Both models show strong discrimination (AUC > 0.84). XGBoost's curve is slightly higher in the low-FPR region, meaning it's better at identifying true churners when keeping false alarms low.

Feature Importance (XGBoost)

Top churn predictors:

  1. tenure — strongest predictor; short tenure = high churn risk
  2. MonthlyCharges — higher charges correlate with churn
  3. TotalCharges — proxy for customer value
  4. Contract (Month-to-month) — highest churn contract type
  5. InternetService (Fiber optic) — fiber customers churn more than DSL

In real-world deployment, minimizing false negatives is critical since missed churners represent direct revenue loss. Threshold tuning or cost-sensitive learning could further improve recall.


💡 Key Insights

Insight Detail
📅 Contract type matters most Month-to-month contracts have ~42% churn vs <5% for 2-year contracts
⏱️ Short tenure = high risk Customers in the first 12 months are most likely to churn
💰 Higher charges → higher churn Median charges for churned customers are significantly higher
🛡️ Protective services help Customers without Online Security, Tech Support, or Online Backup churn more
🌐 Fiber optic paradox Despite being a premium service, fiber optic users churn more — possibly due to higher costs

🚀 How To Run

# 1. Install dependencies
pip install -r requirements.txt

# 2. Train models
python src/train.py

# 3. Evaluate & generate plots
python src/evaluate.py

# 4. Explore EDA notebook
jupyter notebook notebooks/eda.ipynb

All results reported in this README were generated using random_state=42 for reproducibility.


📦 Requirements

  • Python 3.8+
  • pandas, numpy, scikit-learn, xgboost, matplotlib, seaborn, joblib, jupyter

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors