You own a binary classifier that ranks potentially abusive marketplace listings before moderation on a consumer platform. The model is a fine-tuned gradient-boosted tree ensemble trained offline and deployed in Sony's MLOps stack, with listings above a 0.60 score sent to manual review and the rest auto-approved. Offline validation looked strong, but six weeks after launch the trust-and-safety team reports a rise in user complaints and moderator escalations, even though the review queue quality still feels acceptable. You are asked to explain why the model generalizes well on training and validation data but underperforms in production.
| Metric | Offline Validation | Production (last 14 days) |
|---|---|---|
| Accuracy | 0.94 | 0.91 |
| Precision | 0.88 | 0.86 |
| Recall | 0.81 | 0.57 |
| F1 Score | 0.84 | 0.68 |
| AUC-ROC | 0.93 | 0.84 |
| Log Loss | 0.19 | 0.34 |
| Positive prediction rate | 9.8% | 6.1% |
| True positive rate in new sellers | 0.79 | 0.42 |
| Abuse prevalence | 11.2% | 14.9% |
How would you diagnose the gap between offline and production performance, and what changes would you recommend to restore production quality without overwhelming the moderation team?