From Clicks to Conversions: 4-Cluster Segmentation of 12K+ Sessions Achieving 99% Precision in Purchase Prediction

Why It Mattered → Business Context

E-commerce companies lose 84% of potential customers to cart abandonment and poor engagement. Understanding which visitors will convert before they bounce is critical for:

Optimizing ad spend by retargeting high-intent users
Reducing bounce rates through personalized UX interventions
Increasing ROI on marketing campaigns by focusing on users most likely to purchase

This project analyzed 12,330 real-world shopping sessions to predict purchase intent and segment users by behavior—enabling data-driven strategies to boost conversion rates and reduce wasted marketing dollars.

What I Did → Tools + My Role

Role: Data Analyst & ML Engineer Objective: Build a predictive model to classify purchase intent and identify high-value customer segments

Technical Stack:

Languages: R (tidyverse, dplyr, ggplot2)
Machine Learning: SVM (e1071), Decision Trees (rpart), Random Forest, K-Means Clustering
Evaluation: caret, pROC (ROC curves), confusion matrices
Visualization: ggplot2, factoextra, PCA projections

Deliverables:

Cleaned & preprocessed dataset with 18+ engineered features
K-Means clustering model (4 distinct customer segments)
SVM classifier achieving 98.99% accuracy and 99.9% AUC
Comprehensive analysis report with actionable business recommendations

How I Did It → Focus on Thinking + Key Decisions

Strategic Data Cleaning (Not Just Mechanical)

Challenge: Highly skewed distributions (BounceRates, ExitRates) and class imbalance (84% non-purchasers).

Decision:

Applied log transformations (log1p) instead of standard scaling to preserve relationships in skewed data
Used IQR capping (not deletion) for outliers to retain sample size
Manually oversampled minority class to balance training data—avoiding synthetic data generation that could introduce noise

Why it mattered: Preserved 100% of data integrity while ensuring models could learn from rare purchase events.

Feature Engineering Based on Domain Knowledge

Challenge: Raw features didn't capture engagement depth.

Decision:

Created ProductEngagement bins (Low/Medium/High) from continuous duration data
Applied rolling mean smoothing to SpecialDay variable to reduce noise from event spikes
Dropped SpecialDay after correlation analysis showed weak predictive power (-0.08 with Revenue)

Critical thinking: Instead of using all features blindly, I tested statistical significance (t-tests, correlation) and removed low-impact variables—improving model interpretability without sacrificing accuracy.

Chose Clustering BEFORE Classification (Unsupervised → Supervised Pipeline)

Challenge: No clear understanding of user archetypes in the data.

Decision:

Used Elbow Method + Silhouette Scores to objectively determine k=4 clusters
Validated clusters by cross-referencing with Revenue labels—discovered Cluster 3 had 30.5% conversion vs. 0.65% in Cluster 1
Used cluster insights to inform feature selection for classification

Why this approach: Clustering revealed hidden patterns (e.g., seasonal buyers, browsers) that became critical features for the SVM model—bridging unsupervised and supervised learning.

Model Selection: Chose SVM Over Decision Trees (And Proved It)

Challenge: Decision Tree showed 92.8% accuracy but struggled with Cluster 3 (62% sensitivity).

Decision:

Hyperparameter tuned both models using 5-fold cross-validation
SVM (RBF kernel, C=1.0) achieved 98.99% accuracy vs. Decision Tree's 95.82%
Manually calculated Precision (99.49%) and Recall (99.74%) to confirm balanced performance

Critical thinking: Didn't stop at accuracy—analyzed confusion matrix, ROC curve (AUC=0.999), and class-specific performance to ensure the model wasn't just memorizing the majority class.

Validated Against Overfitting (Feature Leakage Check)

Challenge: Near-perfect accuracy raised red flags about data leakage.

Decision:

Removed MonthMay and SpecialDay_Smoothed (high correlation: -0.76, -0.62) to test model dependency
Retrained on reduced feature set—accuracy dropped only 0.03% (from 99.95% to 99.92%)
Conducted train-test split (80-20) and cross-validation to confirm generalization

Why it mattered: Proved the model wasn't overfitting to obvious features—it genuinely learned behavioral patterns like BounceRates, ExitRates, and PageValues.

What Happened → Clear Business Outcome

Model Performance:

SVM Accuracy: 98.99% (optimized)
Precision: 99.49% (minimal false positives)
Recall: 99.74% (catches almost all potential buyers)
AUC: 0.9990 (near-perfect class separation)

Business Insights:

Identified High-Value Segment (Cluster 3): - 30.5% conversion rate vs. site average of ~16% - Characterized by: Low bounce rates, high PageValues, longer product browsing time - Actionable: Target this segment with premium offers and retargeting ads
Low-Engagement Warning (Cluster 1): - 0.65% conversion rate (1,081 users) - High bounce rates, minimal time on product pages - Actionable: Exclude from expensive ad campaigns; A/B test landing page improvements
Seasonal Revenue Spikes: - November and March show highest purchase intent - Actionable: Increase ad spend 4-6 weeks before these months; prepare inventory
Key Predictors Ranked: - PageValues (0.49 correlation) → Optimize high-value pages - BounceRates/ExitRates (negative correlation) → Improve page load speed and UX - ProductRelated_Duration → Longer browsing = higher intent

Estimated Business Impact:

If applied to a 100K monthly visitor site:

Targeting Cluster 3 (30.5% conversion) could yield ~30,500 conversions/month
Excluding Cluster 1 from paid ads saves ~9% of ad budget
Optimizing PageValues pages could lift overall conversion by 5-10% (based on correlation strength)

What I Learned → Show Growth Mindset

Technical Growth:

Accuracy isn't everything: Learned to evaluate models using Precision, Recall, F1-Score, AUC, and confusion matrices—not just accuracy. This prevented me from deploying an overfitted model.
Feature engineering > more data: Instead of seeking more records, I created meaningful features (ProductEngagement bins, rolling averages) that boosted model performance more than raw volume.
Unsupervised learning unlocks supervised insights: Clustering revealed customer archetypes I wouldn't have found through classification alone—now I always explore data structure before jumping to prediction.

Strategic Thinking:

Always validate against business logic: When SVM achieved 99.95% accuracy, I didn't celebrate—I tested for data leakage by removing high-correlation features. This discipline prevents real-world model failures.
Interpretability matters for stakeholders: Decision Trees are easier to explain, but SVM performed better. I learned to balance model complexity with business communication—providing feature importance rankings to translate "black box" models into actionable insights.

Next Steps:

I want to improve this project by: - Implementing real-time prediction using a Flask API - Testing ensemble methods (XGBoost, Gradient Boosting) to see if I can push past 99% accuracy - Building an interactive dashboard (Shiny or Tableau) for stakeholders to explore customer segments dynamically

Bottom line: This project taught me that data science isn't about running algorithms—it's about asking the right questions, validating assumptions, and translating findings into business value. I now approach every project by asking: "So what? How does this change our decision?"

🛠️ Tools Used

R
tidyverse
ggplot2
dplyr

📊 Meaning customer Intentions clusters

📦 Installation & Run

# In RStudio or R console
source("OSI_analysis.R")

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
DESCRIPTION		DESCRIPTION
README.md		README.md
ecommerce_sessions_raw.csv		ecommerce_sessions_raw.csv
environment.yml		environment.yml
gitignore		gitignore
requirements.txt		requirements.txt
setup_r_environment.R		setup_r_environment.R
shopper_intent_ml_analysis.Rmd		shopper_intent_ml_analysis.Rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

From Clicks to Conversions: 4-Cluster Segmentation of 12K+ Sessions Achieving 99% Precision in Purchase Prediction

🛠️ Tools Used

📊 Meaning customer Intentions clusters

📦 Installation & Run

📄 License

About

Uh oh!

Releases

Packages

Languages

sourabhgithubcode/customer-segmentation-clustering

Folders and files

Latest commit

History

Repository files navigation

From Clicks to Conversions: 4-Cluster Segmentation of 12K+ Sessions Achieving 99% Precision in Purchase Prediction

🛠️ Tools Used

📊 Meaning customer Intentions clusters

📦 Installation & Run

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages