You own a gradient-boosted binary classifier that prioritizes satellite imagery tiles for manual review in a defense monitoring workflow. Tiles scoring above a 0.60 threshold are escalated to analysts, and the model was approved after strong offline results on a held-out test set. Two weeks after deployment, stakeholders ask whether the model is overfitting because analyst feedback suggests the alerts are less reliable than expected even though offline metrics looked excellent. You need to assess whether the gap is due to overfitting versus threshold choice or data shift.
| Metric | Training | Validation | Holdout Test | Production Week 2 |
|---|---|---|---|---|
| Accuracy | 0.98 | 0.89 | 0.88 | 0.84 |
| Precision | 0.97 | 0.81 | 0.79 | 0.71 |
| Recall | 0.96 | 0.76 | 0.74 | 0.69 |
| F1 Score | 0.97 | 0.78 | 0.76 | 0.70 |
| AUC-ROC | 0.99 | 0.86 | 0.85 | 0.80 |
| Positive prediction rate | 18% | 14% | 13% | 11% |
How would you determine whether this model is truly overfitting, and what validation approach or model changes would you recommend before expanding deployment?