Context
StreamBox uses a gradient boosted classifier to predict which subscribers are likely to cancel within 30 days so the retention team can send targeted offers. The team suspects the model is suffering from high variance because offline training performance is much stronger than validation and test performance.
Current Performance
| Metric | Training | Validation | Test |
|---|
| Accuracy | 0.94 | 0.81 | 0.80 |
| Precision | 0.91 | 0.68 | 0.66 |
| Recall | 0.88 | 0.57 | 0.55 |
| F1 Score | 0.89 | 0.62 | 0.60 |
| AUC-ROC | 0.97 | 0.76 | 0.74 |
| Log Loss | 0.18 | 0.49 | 0.53 |
Additional details:
- Training set: 1.2M users, churn rate 14%
- Validation set: 150K users, churn rate 15%
- Test set: 150K users, churn rate 15%
- Model depth: 12, trees: 800, min samples per leaf: 2
- Features: 220 behavioral, billing, and support features
The Problem
The VP of Growth wants to know whether the issue is truly overfitting, how to confirm it systematically, and what changes should be made before the next model release.
Requirements
- Explain whether the metric pattern is consistent with high variance and why.
- Describe a step-by-step diagnostic plan to confirm the root cause.
- Identify which additional slices, plots, or validation checks you would run.
- Recommend specific model or data changes to reduce variance.
- Explain how you would measure whether the revised model is actually better.
Constraints
- Retention team can contact at most 25,000 users per week.
- False positives create discount cost and customer annoyance.
- Retraining must fit within a weekly batch pipeline.