Business Context
ShopSphere, an online retail marketplace with 2M monthly users, wants to improve customer targeting. The marketing team needs both a supervised model to predict whether a customer will make a purchase in the next 30 days and an unsupervised approach to segment customers for campaign design.
Dataset
You are given a customer-level dataset built from the last 12 months of activity.
| Feature Group | Count | Examples |
|---|
| Demographics | 5 | age_band, region, device_type, acquisition_channel |
| Behavioral | 12 | sessions_30d, avg_session_duration, pages_per_session, cart_additions_30d |
| Transactional | 8 | orders_90d, avg_order_value, discount_usage_rate, returns_90d |
| Engagement | 6 | email_opens_30d, push_click_rate, days_since_last_visit, wishlist_items |
| Target | 1 | purchased_next_30d |
- Size: 120K customers, 31 input features
- Target: Binary label for supervised learning — whether the customer purchases in the next 30 days
- Class balance: 28% positive, 72% negative
- Missing data: 10% missing in demographics, 6% missing in engagement fields, especially for newly acquired users
Success Criteria
A strong solution should:
- Achieve ROC-AUC >= 0.82 for purchase prediction
- Produce 3-6 actionable customer segments with clear behavioral differences
- Clearly explain the difference between when to use supervised vs. unsupervised learning
Constraints
- Predictions will run in a daily batch job on all active customers
- Marketing stakeholders need interpretable outputs, not just raw scores
- The segmentation approach should be stable enough to refresh monthly
Deliverables
- Build a supervised learning pipeline to predict
purchased_next_30d
- Build an unsupervised learning pipeline to segment customers
- Compare the goals, inputs, outputs, and evaluation of both approaches
- Explain preprocessing choices for mixed numerical and categorical data
- Recommend how both models would be used together in production