You own a binary churn prediction model for a subscription product, deployed as an Azure Machine Learning online endpoint. The model is a gradient-boosted tree classifier that scores active users weekly, and accounts above a 0.60 threshold are sent to a retention campaign with limited budget. Leadership is concerned because the dashboard still shows high accuracy, but the retention team says too many actual churners are not being targeted. You are asked whether the model is truly performing well given that only a small fraction of users churn each week.
| Metric | Validation Set | Last 4 Weeks in Production |
|---|---|---|
| Positive class rate (churn) | 3.2% | 3.5% |
| Accuracy | 97.1% | 96.8% |
| Precision | 0.41 | 0.39 |
| Recall | 0.62 | 0.28 |
| F1 Score | 0.49 | 0.33 |
| AUC-ROC | 0.89 | 0.87 |
| Users flagged per week | 5,400 | 2,600 |
| Actual churners per week | 3,200 | 3,500 |
How would you evaluate this model in the presence of class imbalance, and what would you recommend changing in the evaluation approach or decision threshold before deciding whether to keep this model in production?