You own a gradient-boosted binary classifier that predicts whether a digital grocery order will include a substitution event so operations can pre-position inventory and labor. The model is scored nightly, and orders above a 0.60 threshold trigger proactive replenishment actions. After a recent retrain with more features and deeper trees, the offline training report looked excellent, but the holdout and recent production results are noticeably weaker, and operations leaders are asking whether the new model is overfitting.
| Metric | Training | Validation | Recent Production |
|---|---|---|---|
| Accuracy | 0.94 | 0.81 | 0.79 |
| Precision | 0.91 | 0.74 | 0.72 |
| Recall | 0.89 | 0.63 | 0.60 |
| F1 Score | 0.90 | 0.68 | 0.65 |
| AUC-ROC | 0.97 | 0.78 | 0.76 |
| Log Loss | 0.18 | 0.49 | 0.53 |
| Positive prediction rate | 28% | 19% | 18% |
How would you evaluate whether this model is overfitting, and what would you recommend before keeping it in production?