Business Context
Northstar Retail, an online marketplace with 1.2M customers, wants to better understand customer behavior and improve repeat purchase targeting. The analytics team needs both customer segments for marketing strategy and a predictive model for whether a customer will make a purchase in the next 30 days.
Dataset
You are given a customer-level dataset built from 12 months of transaction and engagement history.
| Feature Group | Count | Examples |
|---|
| Demographics | 5 | age_bucket, region, acquisition_channel, device_type, loyalty_tier |
| Transaction history | 10 | total_orders, avg_order_value, days_since_last_order, refund_rate |
| Engagement | 8 | email_open_rate, app_sessions_30d, product_views_30d, cart_add_rate |
| Support & returns | 4 | support_tickets_90d, return_count_90d, avg_resolution_time |
| Target | 1 | purchased_next_30d |
- Size: 240K customers, 27 input features
- Target: Binary label indicating whether the customer makes at least one purchase in the next 30 days
- Class balance: 28% positive, 72% negative
- Missing data: 12% missing in engagement fields for email-unsubscribed users, 6% missing in demographics
Success Criteria
A strong solution should:
- Build meaningful customer segments using an unsupervised method
- Train a supervised model that achieves ROC-AUC of at least 0.82 on a held-out test set
- Explain the practical difference between supervised and unsupervised learning using this dataset and the resulting outputs
Constraints
- Marketing needs interpretable segments for campaign design
- Scoring must run daily on 240K customers in under 10 minutes
- The solution should use standard Python ML tooling and be maintainable by a small data team
Deliverables
- Build an unsupervised learning pipeline to cluster customers and describe each segment
- Build a supervised learning pipeline to predict
purchased_next_30d
- Compare the objectives, inputs, outputs, and evaluation of both approaches
- Recommend how both models would be used together in production