Business Context
Northstar Retail, an e-commerce marketplace with 1.2M monthly active users, wants to improve lifecycle marketing. The growth team needs both customer segments for campaign design and a model that predicts whether a customer will purchase in the next 30 days.
Dataset
You are given a customer-level dataset built from the last 12 months of activity. The goal is to demonstrate when to use unsupervised learning versus supervised learning on the same business problem.
| Feature Group | Count | Examples |
|---|
| Behavioral metrics | 12 | sessions_30d, avg_session_duration, pages_per_visit, add_to_cart_rate |
| Transaction history | 8 | orders_90d, avg_order_value, discount_usage_rate, return_rate |
| Customer profile | 6 | acquisition_channel, country, device_type, loyalty_tier |
| Engagement recency | 5 | days_since_last_visit, days_since_last_order, email_open_rate |
| Target label | 1 | purchased_next_30d |
- Size: 240K customers, 31 input features
- Target: Binary label indicating whether the customer made at least one purchase in the following 30 days
- Class balance: 18% positive, 82% negative
- Missing data: 9% missing in email engagement fields, 4% missing in profile fields for guest users
Success Criteria
A strong solution should:
- Produce actionable customer segments with clear behavioral differences
- Train a supervised model with ROC-AUC >= 0.82 and F1 >= 0.55 on the holdout set
- Explain clearly why clustering does not use labels and classification does
Constraints
- Marketing needs segment definitions that are easy to explain
- Batch scoring must finish in under 10 minutes for 240K customers
- The solution should be maintainable by a small data team
Deliverables
- Build an unsupervised segmentation pipeline and describe the resulting clusters
- Build a supervised model to predict
purchased_next_30d
- Compare the two approaches and explain when each should be used
- Evaluate both outputs with appropriate metrics
- Recommend how both models would be used together in production