Business Context
ShopSphere, a mid-sized e-commerce marketplace with 1.2M registered users, wants to improve retention and marketing efficiency. The growth team has historical purchase labels for some campaigns, but also wants customer segments for personalization where no labels exist.
Dataset
You are given a customer-level dataset built from 12 months of activity. The task is intentionally split into two parts to test your understanding of supervised vs. unsupervised learning using the same business domain.
| Feature Group | Count | Examples |
|---|
| Demographics | 6 | age, region, device_type, acquisition_channel |
| Behavioral | 12 | sessions_last_30d, avg_session_duration, pages_per_session, cart_add_rate |
| Transactional | 10 | orders_last_90d, avg_order_value, discount_usage_rate, refund_rate |
| Engagement | 8 | email_open_rate, push_click_rate, days_since_last_visit, loyalty_points |
| | |
- Size: 240K customers, 36 features
- Target for supervised task: purchased in the next 30 days (1/0)
- Class balance: 18% positive, 82% negative
- Missing data: 9% missing in demographics, 4% missing in engagement for recently acquired users
Success Criteria
A strong solution should:
- Build a supervised model that predicts 30-day purchase intent with ROC-AUC e 0.82 and F1 e 0.55
- Build an unsupervised segmentation approach with interpretable clusters and silhouette score e 0.20
- Clearly explain the difference between the two learning paradigms, including when labels are required and how evaluation differs
Constraints
- Marketing needs interpretable outputs for campaign design
- Batch scoring must complete in under 10 minutes daily
- The solution should use standard Python ML tooling and be easy to retrain monthly
Deliverables
- Train one supervised model for purchase prediction
- Train one unsupervised model for customer segmentation
- Compare inputs, outputs, evaluation, and business use cases of both approaches
- Describe feature preprocessing choices for mixed numeric/categorical data
- Recommend which method to use for conversion targeting vs. exploratory segmentation