Business Context
ShopSphere, an e-commerce marketplace with 2.4M monthly users, wants two capabilities from the same customer dataset: (1) predict whether a customer will make a purchase in the next 30 days, and (2) discover natural customer segments for lifecycle marketing. This question tests whether you understand when to use supervised vs. unsupervised learning and how to implement both correctly.
Dataset
You are given a customer-level feature table built from the last 90 days of activity.
| Feature Group | Count | Examples |
|---|
| Behavioral | 12 | sessions_30d, avg_session_minutes, product_views_30d, cart_adds_30d |
| Transactional | 8 | orders_90d, avg_order_value, discount_usage_rate, returns_90d |
| Marketing | 5 | email_open_rate, push_click_rate, acquisition_channel |
| Profile | 6 | country, device_type, tenure_days, loyalty_tier |
| Derived | 7 | recency_days, views_per_session, cart_to_view_rate |
- Size: 120K customers, 38 features
- Target for supervised task:
purchased_next_30d (1 if customer purchases in the next 30 days, else 0)
- Class balance: 22% positive, 78% negative
- Missing data: ~9% missing in marketing engagement fields, ~4% missing in avg_order_value for customers with no prior purchases
Success Criteria
A good solution should:
- Build a supervised model that achieves ROC-AUC >= 0.82 and F1 >= 0.60 on a held-out test set
- Produce customer segments with silhouette score >= 0.25 and clear business interpretation
- Clearly explain why prediction uses supervised learning and segmentation uses unsupervised learning
Constraints
- Batch scoring must finish in under 10 minutes for 120K customers
- Marketing stakeholders need interpretable outputs: feature importance for prediction and understandable cluster profiles
- Retraining can happen weekly; no real-time serving is required
Deliverables
- Build a supervised model to predict
purchased_next_30d
- Build an unsupervised model to segment customers
- Compare the goals, inputs, outputs, and evaluation of both approaches
- Describe preprocessing choices for mixed feature types and missing data
- Recommend how both models would be used together in production