Business Context
ShopSphere, an e-commerce marketplace with 2.4M monthly active users, wants to segment shoppers into behavior-based groups for lifecycle marketing and merchandising. There are no labeled segment definitions today, so the team needs an unsupervised clustering solution that is stable, interpretable, and usable in production.
Dataset
The modeling table is built at the customer-month level from the last 12 months of activity.
| Feature Group | Count | Examples |
|---|
| Purchase behavior | 8 | orders_30d, avg_order_value, discount_share, return_rate |
| Browsing behavior | 6 | sessions_30d, product_views, search_queries, dwell_time |
| Category affinity | 10 | pct_spend_electronics, pct_spend_home, pct_spend_fashion |
| Engagement & tenure | 5 | days_since_last_order, account_age_days, email_click_rate |
| Geography / device | 4 | region, device_type, app_vs_web_share |
- Size: 420K customer-month rows, 33 features
- Missing data: ~12% missing in email engagement for unsubscribed users, ~4% missing in browsing metrics due to tracking gaps
- Data characteristics: mixed numerical and categorical features, strong skew in spend/order variables, many low-activity users
Success Criteria
A good solution should produce 4-8 actionable clusters with:
- silhouette score >= 0.20 after preprocessing
- cluster stability (Adjusted Rand Index across resamples) >= 0.75
- clear business interpretation for marketing and merchandising teams
Constraints
- Segments must be explainable to non-technical stakeholders
- Batch scoring must finish in under 30 minutes weekly
- The team prefers simple maintenance over highly complex deep learning approaches
Deliverables
- Explain what clustering is and when it is appropriate instead of classification or regression.
- Build a clustering pipeline for customer segmentation, including preprocessing and feature engineering.
- Compare at least two clustering algorithms and justify the final choice.
- Evaluate cluster quality using quantitative metrics and qualitative interpretation.
- Describe how you would deploy, monitor, and refresh segments in production.