Business Context
Northstar Retail, an e-commerce marketplace with 2.4M annual customers, wants to improve lifecycle marketing. The data science team needs both a supervised model to predict whether a customer will purchase in the next 30 days and an unsupervised model to segment customers for campaign targeting.
Dataset
You are given a customer-level dataset built from the last 12 months of activity.
| Feature Group | Count | Examples |
|---|
| Transaction history | 8 | total_orders, avg_order_value, days_since_last_purchase |
| Engagement | 6 | email_open_rate, app_sessions_30d, website_visits_30d |
| Customer profile | 5 | region, acquisition_channel, loyalty_tier |
| Product behavior | 5 | categories_bought, discount_usage_rate, return_rate |
| Target label | 1 | purchased_next_30d |
- Size: 120K customers, 24 input features, 1 binary target
- Target:
purchased_next_30d where 1 indicates a purchase in the next 30 days
- Class balance: 28% positive, 72% negative
- Missing data: ~7% missing in engagement fields, ~3% missing in profile fields
Success Criteria
A strong solution should:
- Explain clearly when to use supervised learning vs unsupervised learning
- Build a purchase prediction model with ROC-AUC >= 0.82 and F1 >= 0.68 on the test set
- Produce customer segments that are stable, interpretable, and useful for marketing
Constraints
- Marketing needs segment definitions simple enough to explain to non-technical stakeholders
- Batch scoring must finish in under 10 minutes for 120K customers
- The solution should be maintainable by a small ML team using standard Python tooling
Deliverables
- Explain the difference between supervised and unsupervised learning using this dataset
- Train a supervised model for
purchased_next_30d
- Train an unsupervised clustering model for customer segmentation
- Compare how the data preparation, training objective, and evaluation differ between the two approaches
- Recommend how both models would be used together in production