You own a gradient-boosted churn prediction model for a subscription platform. The model scores active accounts weekly, and accounts above a 0.40 threshold are sent to the retention team for outreach. Leadership is questioning whether the recent analysis showing strong offline performance is trustworthy because renewal outcomes in production have been weaker than expected, especially for high-value accounts. You need to explain how you would ensure the findings are reliable before recommending any model or threshold changes.
| Metric | Offline Validation | Last 4 Weeks in Production |
|---|---|---|
| Accuracy | 0.84 | 0.83 |
| Precision | 0.61 | 0.58 |
| Recall | 0.74 | 0.49 |
| F1 Score | 0.67 | 0.53 |
| AUC-ROC | 0.87 | 0.79 |
| Log Loss | 0.41 | 0.56 |
| Predicted churn rate | 18.5% | 19.1% |
| Actual churn rate | 17.9% | 18.7% |
| High-value segment recall | 0.71 | 0.38 |
| Weekly accounts flagged | 12,400 | 11,900 |
How would you determine whether your findings are reliable, diagnose the gap between validation and production performance, and decide what evidence is strong enough to support changes to the model or decision threshold?