Business Context
ShopSphere, a mid-sized e-commerce marketplace with 1.2M registered users, wants to improve retention and marketing efficiency. The data team needs both a model to predict which customers will churn and a segmentation approach for customers who have no labeled outcomes.
Dataset
You are given a customer-level dataset extracted from 12 months of activity.
| Feature Group | Count | Examples |
|---|
| Behavioral metrics | 12 | sessions_30d, avg_session_duration, add_to_cart_rate, days_since_last_visit |
| Transactional metrics | 9 | orders_90d, avg_order_value, discount_usage_rate, refund_rate |
| Customer profile | 6 | country, device_type, acquisition_channel, loyalty_tier |
| Engagement metrics | 5 | email_open_rate, push_click_rate, support_tickets_90d |
| Label | 1 | churned_60d |
- Size: 240K customers, 32 input features
- Target availability: 180K rows include a churn label; 60K rows are unlabeled due to recent acquisition or tracking gaps
- Class balance: 18% churned, 82% retained in the labeled subset
- Missing data: ~8% missing in engagement fields, ~3% missing in profile fields
Success Criteria
A strong solution should:
- Build a supervised model that predicts 60-day churn with ROC-AUC >= 0.82 and F1 >= 0.55 on a held-out test set
- Produce an unsupervised segmentation with clusters that are stable, interpretable, and useful for marketing
- Clearly explain the difference between supervised and unsupervised learning using this dataset
Constraints
- Predictions must run in a nightly batch job for 1.2M users in under 20 minutes
- Marketing needs interpretable outputs, not a black-box-only solution
- The same preprocessing logic must work for both labeled and unlabeled customers
Deliverables
- Train a supervised model on labeled customers to predict churn
- Build an unsupervised clustering pipeline on the full customer base
- Compare the two approaches, including when each should be used
- Evaluate both approaches with appropriate metrics and interpretation
- Recommend how to deploy both outputs in production