You own a gradient-boosted tree model that predicts 30-day churn for a subscription product and triggers retention offers for users with scores above 0.60. The model was retrained last week after adding dozens of behavioral features from recent product usage logs, and the training team is excited because offline training performance improved sharply. However, when you evaluated the same model on validation and untouched holdout data, performance dropped meaningfully, and finance is worried the retention budget is being spent on the wrong users. You need to assess whether the model is overfitting and decide what to do before rollout.
| Metric | Training | Validation | Holdout Test |
|---|---|---|---|
| Accuracy | 0.94 | 0.81 | 0.79 |
| Precision | 0.91 | 0.68 | 0.65 |
| Recall | 0.88 | 0.59 | 0.56 |
| F1 Score | 0.89 | 0.63 | 0.60 |
| AUC-ROC | 0.97 | 0.76 | 0.74 |
| Log Loss | 0.18 | 0.49 | 0.53 |
| Positive prediction rate | 14% | 11% | 10% |
How would you evaluate whether this model is overfitting, and what would you recommend before deploying it to drive retention actions?