Business Context
ShopSphere, an online retail marketplace with 2M monthly users, wants to improve customer targeting. The growth team needs both a supervised model to predict whether a customer will purchase in the next 30 days and an unsupervised model to segment customers for marketing campaigns.
Dataset
You are given a customer-level dataset built from the last 12 months of activity.
| Feature Group | Count | Examples |
|---|
| Demographics | 5 | age, region, device_type, acquisition_channel |
| Behavioral | 10 | sessions_last_30d, avg_session_length, pages_per_session, cart_add_rate |
| Transactional | 8 | orders_last_90d, avg_order_value, refund_rate, days_since_last_purchase |
| Engagement | 5 | email_open_rate, push_click_rate, wishlist_count, support_tickets |
| Target | 1 | purchased_next_30d |
- Size: 120K customers, 28 input features
- Target: Binary label for supervised learning — whether the customer makes a purchase in the next 30 days
- Class balance: 22% positive, 78% negative
- Missing data: 8% missing in engagement features, 3% missing in demographics
Success Criteria
A good solution should:
- Achieve ROC-AUC >= 0.82 on the supervised task
- Produce interpretable customer segments with a silhouette score >= 0.30 on the unsupervised task
- Clearly explain when supervised vs. unsupervised learning is appropriate
Constraints
- Predictions will run daily in batch on 120K customers
- Marketing stakeholders need interpretable outputs
- Training should complete within 30 minutes on a standard cloud VM
Deliverables
- Build a supervised learning pipeline to predict
purchased_next_30d
- Build an unsupervised learning pipeline to segment customers without using the target
- Compare the two approaches and explain the difference in objective, data requirements, and evaluation
- Report model performance with appropriate metrics
- Recommend how both outputs would be used in production