Skip to content

Analyzed e-commerce shopper data using R to explore behavior patterns, clean data, and identify key factors influencing purchase intent.

Notifications You must be signed in to change notification settings

sourabhgithubcode/customer-segmentation-clustering

Repository files navigation

From Clicks to Conversions: 4-Cluster Segmentation of 12K+ Sessions Achieving 99% Precision in Purchase Prediction

Why It Mattered → Business Context

E-commerce companies lose 84% of potential customers to cart abandonment and poor engagement. Understanding which visitors will convert before they bounce is critical for:

  • Optimizing ad spend by retargeting high-intent users
  • Reducing bounce rates through personalized UX interventions
  • Increasing ROI on marketing campaigns by focusing on users most likely to purchase

This project analyzed 12,330 real-world shopping sessions to predict purchase intent and segment users by behavior—enabling data-driven strategies to boost conversion rates and reduce wasted marketing dollars.


What I Did → Tools + My Role

Role: Data Analyst & ML Engineer Objective: Build a predictive model to classify purchase intent and identify high-value customer segments

Technical Stack:

  • Languages: R (tidyverse, dplyr, ggplot2)
  • Machine Learning: SVM (e1071), Decision Trees (rpart), Random Forest, K-Means Clustering
  • Evaluation: caret, pROC (ROC curves), confusion matrices
  • Visualization: ggplot2, factoextra, PCA projections

Deliverables:

  • Cleaned & preprocessed dataset with 18+ engineered features
  • K-Means clustering model (4 distinct customer segments)
  • SVM classifier achieving 98.99% accuracy and 99.9% AUC
  • Comprehensive analysis report with actionable business recommendations

How I Did It → Focus on Thinking + Key Decisions

  1. Strategic Data Cleaning (Not Just Mechanical)

Challenge: Highly skewed distributions (BounceRates, ExitRates) and class imbalance (84% non-purchasers).

Decision:

  • Applied log transformations (log1p) instead of standard scaling to preserve relationships in skewed data
  • Used IQR capping (not deletion) for outliers to retain sample size
  • Manually oversampled minority class to balance training data—avoiding synthetic data generation that could introduce noise

Why it mattered: Preserved 100% of data integrity while ensuring models could learn from rare purchase events.


  1. Feature Engineering Based on Domain Knowledge

Challenge: Raw features didn't capture engagement depth.

Decision:

  • Created ProductEngagement bins (Low/Medium/High) from continuous duration data
  • Applied rolling mean smoothing to SpecialDay variable to reduce noise from event spikes
  • Dropped SpecialDay after correlation analysis showed weak predictive power (-0.08 with Revenue)

Critical thinking: Instead of using all features blindly, I tested statistical significance (t-tests, correlation) and removed low-impact variables—improving model interpretability without sacrificing accuracy.


  1. Chose Clustering BEFORE Classification (Unsupervised → Supervised Pipeline)

Challenge: No clear understanding of user archetypes in the data.

Decision:

  • Used Elbow Method + Silhouette Scores to objectively determine k=4 clusters
  • Validated clusters by cross-referencing with Revenue labels—discovered Cluster 3 had 30.5% conversion vs. 0.65% in Cluster 1
  • Used cluster insights to inform feature selection for classification

Why this approach: Clustering revealed hidden patterns (e.g., seasonal buyers, browsers) that became critical features for the SVM model—bridging unsupervised and supervised learning.


  1. Model Selection: Chose SVM Over Decision Trees (And Proved It)

Challenge: Decision Tree showed 92.8% accuracy but struggled with Cluster 3 (62% sensitivity).

Decision:

  • Hyperparameter tuned both models using 5-fold cross-validation
  • SVM (RBF kernel, C=1.0) achieved 98.99% accuracy vs. Decision Tree's 95.82%
  • Manually calculated Precision (99.49%) and Recall (99.74%) to confirm balanced performance

Critical thinking: Didn't stop at accuracy—analyzed confusion matrix, ROC curve (AUC=0.999), and class-specific performance to ensure the model wasn't just memorizing the majority class.


  1. Validated Against Overfitting (Feature Leakage Check)

Challenge: Near-perfect accuracy raised red flags about data leakage.

Decision:

  • Removed MonthMay and SpecialDay_Smoothed (high correlation: -0.76, -0.62) to test model dependency
  • Retrained on reduced feature set—accuracy dropped only 0.03% (from 99.95% to 99.92%)
  • Conducted train-test split (80-20) and cross-validation to confirm generalization

Why it mattered: Proved the model wasn't overfitting to obvious features—it genuinely learned behavioral patterns like BounceRates, ExitRates, and PageValues.


What Happened → Clear Business Outcome

Model Performance:

  • SVM Accuracy: 98.99% (optimized)
  • Precision: 99.49% (minimal false positives)
  • Recall: 99.74% (catches almost all potential buyers)
  • AUC: 0.9990 (near-perfect class separation)

Business Insights:

  1. Identified High-Value Segment (Cluster 3): - 30.5% conversion rate vs. site average of ~16% - Characterized by: Low bounce rates, high PageValues, longer product browsing time - Actionable: Target this segment with premium offers and retargeting ads
  2. Low-Engagement Warning (Cluster 1): - 0.65% conversion rate (1,081 users) - High bounce rates, minimal time on product pages - Actionable: Exclude from expensive ad campaigns; A/B test landing page improvements
  3. Seasonal Revenue Spikes: - November and March show highest purchase intent - Actionable: Increase ad spend 4-6 weeks before these months; prepare inventory
  4. Key Predictors Ranked: - PageValues (0.49 correlation) → Optimize high-value pages - BounceRates/ExitRates (negative correlation) → Improve page load speed and UX - ProductRelated_Duration → Longer browsing = higher intent

Estimated Business Impact:

If applied to a 100K monthly visitor site:

  • Targeting Cluster 3 (30.5% conversion) could yield ~30,500 conversions/month
  • Excluding Cluster 1 from paid ads saves ~9% of ad budget
  • Optimizing PageValues pages could lift overall conversion by 5-10% (based on correlation strength)

What I Learned → Show Growth Mindset

Technical Growth:

  1. Accuracy isn't everything: Learned to evaluate models using Precision, Recall, F1-Score, AUC, and confusion matrices—not just accuracy. This prevented me from deploying an overfitted model.
  2. Feature engineering > more data: Instead of seeking more records, I created meaningful features (ProductEngagement bins, rolling averages) that boosted model performance more than raw volume.
  3. Unsupervised learning unlocks supervised insights: Clustering revealed customer archetypes I wouldn't have found through classification alone—now I always explore data structure before jumping to prediction.

Strategic Thinking:

  1. Always validate against business logic: When SVM achieved 99.95% accuracy, I didn't celebrate—I tested for data leakage by removing high-correlation features. This discipline prevents real-world model failures.
  2. Interpretability matters for stakeholders: Decision Trees are easier to explain, but SVM performed better. I learned to balance model complexity with business communication—providing feature importance rankings to translate "black box" models into actionable insights.

Next Steps:

  1. I want to improve this project by: - Implementing real-time prediction using a Flask API - Testing ensemble methods (XGBoost, Gradient Boosting) to see if I can push past 99% accuracy - Building an interactive dashboard (Shiny or Tableau) for stakeholders to explore customer segments dynamically

Bottom line: This project taught me that data science isn't about running algorithms—it's about asking the right questions, validating assumptions, and translating findings into business value. I now approach every project by asking: "So what? How does this change our decision?"


🛠️ Tools Used

  • R
  • tidyverse
  • ggplot2
  • dplyr

📊 Meaning customer Intentions clusters

Screenshot 2026-01-01 at 8 18 24 PM

📦 Installation & Run

# In RStudio or R console
source("OSI_analysis.R")

📄 License

MIT

About

Analyzed e-commerce shopper data using R to explore behavior patterns, clean data, and identify key factors influencing purchase intent.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages