Business Context
Northstar Retail, an e-commerce marketplace with 2.4M registered users, wants to improve customer targeting. The growth team needs both a supervised model to predict which users will purchase in the next 30 days and an unsupervised model to discover customer segments for marketing campaigns.
Dataset
You are given a user-level dataset built from the last 12 months of activity.
| Feature Group | Count | Examples |
|---|
| Behavioral | 12 | sessions_30d, avg_session_duration, add_to_cart_rate, product_views_30d |
| Transactional | 8 | orders_90d, avg_order_value, refund_rate, days_since_last_purchase |
| Customer profile | 6 | acquisition_channel, country, device_type, loyalty_tier |
| Engagement | 5 | email_open_rate, push_click_rate, wishlist_count, support_tickets_90d |
| | |
| Target | 1 | purchased_next_30d |
- Size: 320K users, 31 input features
- Target: Binary label indicating whether a user makes at least one purchase in the next 30 days
- Class balance: 18% positive, 82% negative
- Missing data: 9% missing in engagement features, 4% missing in profile fields, and sparse purchase history for new users
Success Criteria
A strong solution should:
- Build a supervised learning pipeline that achieves ROC-AUC >= 0.82 and F1 >= 0.55 on the held-out test set
- Build an unsupervised segmentation approach that produces clusters with silhouette score >= 0.20 and clear business interpretation
- Clearly explain the difference between supervised and unsupervised learning using this dataset and justify when each should be used
Constraints
- Predictions will run in a daily batch job on all active users
- Marketing stakeholders need interpretable outputs, not just raw scores
- Training should complete within a standard notebook or single VM environment
Deliverables
- Train a supervised model to predict
purchased_next_30d
- Train an unsupervised model to segment customers using the same feature space (excluding the target)
- Compare the goals, inputs, outputs, and evaluation of supervised vs. unsupervised learning
- Describe feature engineering, preprocessing, and validation choices
- Recommend how both models would be used together in production