Business Context
StreamCart, a mid-sized subscription video platform with 2.4M monthly active users, wants to improve retention. The analytics team needs both unsupervised learning to discover natural customer segments and supervised learning to predict which users are likely to churn in the next 30 days.
Dataset
You are given a user-level dataset built from the last 12 months of activity.
| Feature Group | Count | Examples |
|---|
| Engagement | 10 | weekly_watch_hours, sessions_per_week, completion_rate |
| Subscription | 6 | plan_type, tenure_days, monthly_price, auto_renew |
| Device & Region | 5 | primary_device, country, app_version |
| Support & Billing | 5 | support_tickets_90d, payment_failures_90d |
| Derived behavior | 6 | days_since_last_watch, weekend_ratio, genre_diversity |
| | |
- Size: 120K users, 32 features
- Target for supervised task:
churn_30d (1 if user cancels within 30 days, else 0)
- Unsupervised task: no target label; identify meaningful user segments
- Class balance: 14% churn, 86% retained
- Missing data: ~8% missing in support and billing fields, ~3% missing in device metadata
Success Criteria
A strong solution should:
- Build a churn classifier with ROC-AUC >= 0.84 and F1 >= 0.55 on the holdout set
- Produce 3-6 interpretable user segments with clear behavioral differences
- Clearly explain when supervised learning is appropriate vs when unsupervised learning is appropriate
Constraints
- Predictions are generated in a nightly batch job for 120K users
- Marketing needs segment definitions simple enough to act on
- The retention team requires feature importance for churn predictions
Deliverables
- Train one supervised model to predict
churn_30d
- Train one unsupervised model to segment users
- Compare the goals, inputs, outputs, and evaluation of both approaches
- Describe preprocessing and feature engineering choices
- Report metrics and recommend how both models would be used together in production