From Clicks to Conversions: 4-Cluster Segmentation of 12K+ Sessions Achieving 99% Precision in Purchase Prediction
Why It Mattered → Business Context
E-commerce companies lose 84% of potential customers to cart abandonment and poor engagement. Understanding which visitors will convert before they bounce is critical for:
- Optimizing ad spend by retargeting high-intent users
- Reducing bounce rates through personalized UX interventions
- Increasing ROI on marketing campaigns by focusing on users most likely to purchase
This project analyzed 12,330 real-world shopping sessions to predict purchase intent and segment users by behavior—enabling data-driven strategies to boost conversion rates and reduce wasted marketing dollars.
What I Did → Tools + My Role
Role: Data Analyst & ML Engineer Objective: Build a predictive model to classify purchase intent and identify high-value customer segments
Technical Stack:
- Languages: R (tidyverse, dplyr, ggplot2)
- Machine Learning: SVM (e1071), Decision Trees (rpart), Random Forest, K-Means Clustering
- Evaluation: caret, pROC (ROC curves), confusion matrices
- Visualization: ggplot2, factoextra, PCA projections
Deliverables:
- Cleaned & preprocessed dataset with 18+ engineered features
- K-Means clustering model (4 distinct customer segments)
- SVM classifier achieving 98.99% accuracy and 99.9% AUC
- Comprehensive analysis report with actionable business recommendations
How I Did It → Focus on Thinking + Key Decisions
- Strategic Data Cleaning (Not Just Mechanical)
Challenge: Highly skewed distributions (BounceRates, ExitRates) and class imbalance (84% non-purchasers).
Decision:
- Applied log transformations (log1p) instead of standard scaling to preserve relationships in skewed data
- Used IQR capping (not deletion) for outliers to retain sample size
- Manually oversampled minority class to balance training data—avoiding synthetic data generation that could introduce noise
Why it mattered: Preserved 100% of data integrity while ensuring models could learn from rare purchase events.
- Feature Engineering Based on Domain Knowledge
Challenge: Raw features didn't capture engagement depth.
Decision:
- Created ProductEngagement bins (Low/Medium/High) from continuous duration data
- Applied rolling mean smoothing to SpecialDay variable to reduce noise from event spikes
- Dropped SpecialDay after correlation analysis showed weak predictive power (-0.08 with Revenue)
Critical thinking: Instead of using all features blindly, I tested statistical significance (t-tests, correlation) and removed low-impact variables—improving model interpretability without sacrificing accuracy.
- Chose Clustering BEFORE Classification (Unsupervised → Supervised Pipeline)
Challenge: No clear understanding of user archetypes in the data.
Decision:
- Used Elbow Method + Silhouette Scores to objectively determine k=4 clusters
- Validated clusters by cross-referencing with Revenue labels—discovered Cluster 3 had 30.5% conversion vs. 0.65% in Cluster 1
- Used cluster insights to inform feature selection for classification
Why this approach: Clustering revealed hidden patterns (e.g., seasonal buyers, browsers) that became critical features for the SVM model—bridging unsupervised and supervised learning.
- Model Selection: Chose SVM Over Decision Trees (And Proved It)
Challenge: Decision Tree showed 92.8% accuracy but struggled with Cluster 3 (62% sensitivity).
Decision:
- Hyperparameter tuned both models using 5-fold cross-validation
- SVM (RBF kernel, C=1.0) achieved 98.99% accuracy vs. Decision Tree's 95.82%
- Manually calculated Precision (99.49%) and Recall (99.74%) to confirm balanced performance
Critical thinking: Didn't stop at accuracy—analyzed confusion matrix, ROC curve (AUC=0.999), and class-specific performance to ensure the model wasn't just memorizing the majority class.
- Validated Against Overfitting (Feature Leakage Check)
Challenge: Near-perfect accuracy raised red flags about data leakage.
Decision:
- Removed MonthMay and SpecialDay_Smoothed (high correlation: -0.76, -0.62) to test model dependency
- Retrained on reduced feature set—accuracy dropped only 0.03% (from 99.95% to 99.92%)
- Conducted train-test split (80-20) and cross-validation to confirm generalization
Why it mattered: Proved the model wasn't overfitting to obvious features—it genuinely learned behavioral patterns like BounceRates, ExitRates, and PageValues.
What Happened → Clear Business Outcome
Model Performance:
- SVM Accuracy: 98.99% (optimized)
- Precision: 99.49% (minimal false positives)
- Recall: 99.74% (catches almost all potential buyers)
- AUC: 0.9990 (near-perfect class separation)
Business Insights:
- Identified High-Value Segment (Cluster 3): - 30.5% conversion rate vs. site average of ~16% - Characterized by: Low bounce rates, high PageValues, longer product browsing time - Actionable: Target this segment with premium offers and retargeting ads
- Low-Engagement Warning (Cluster 1): - 0.65% conversion rate (1,081 users) - High bounce rates, minimal time on product pages - Actionable: Exclude from expensive ad campaigns; A/B test landing page improvements
- Seasonal Revenue Spikes: - November and March show highest purchase intent - Actionable: Increase ad spend 4-6 weeks before these months; prepare inventory
- Key Predictors Ranked: - PageValues (0.49 correlation) → Optimize high-value pages - BounceRates/ExitRates (negative correlation) → Improve page load speed and UX - ProductRelated_Duration → Longer browsing = higher intent
Estimated Business Impact:
If applied to a 100K monthly visitor site:
- Targeting Cluster 3 (30.5% conversion) could yield ~30,500 conversions/month
- Excluding Cluster 1 from paid ads saves ~9% of ad budget
- Optimizing PageValues pages could lift overall conversion by 5-10% (based on correlation strength)
What I Learned → Show Growth Mindset
Technical Growth:
- Accuracy isn't everything: Learned to evaluate models using Precision, Recall, F1-Score, AUC, and confusion matrices—not just accuracy. This prevented me from deploying an overfitted model.
- Feature engineering > more data: Instead of seeking more records, I created meaningful features (ProductEngagement bins, rolling averages) that boosted model performance more than raw volume.
- Unsupervised learning unlocks supervised insights: Clustering revealed customer archetypes I wouldn't have found through classification alone—now I always explore data structure before jumping to prediction.
Strategic Thinking:
- Always validate against business logic: When SVM achieved 99.95% accuracy, I didn't celebrate—I tested for data leakage by removing high-correlation features. This discipline prevents real-world model failures.
- Interpretability matters for stakeholders: Decision Trees are easier to explain, but SVM performed better. I learned to balance model complexity with business communication—providing feature importance rankings to translate "black box" models into actionable insights.
Next Steps:
- I want to improve this project by: - Implementing real-time prediction using a Flask API - Testing ensemble methods (XGBoost, Gradient Boosting) to see if I can push past 99% accuracy - Building an interactive dashboard (Shiny or Tableau) for stakeholders to explore customer segments dynamically
Bottom line: This project taught me that data science isn't about running algorithms—it's about asking the right questions, validating assumptions, and translating findings into business value. I now approach every project by asking: "So what? How does this change our decision?"
- R
- tidyverse
- ggplot2
- dplyr
# In RStudio or R console
source("OSI_analysis.R")MIT