You own a gradient-boosted binary classifier that scores whether a newly posted message on a social platform should be sent to human moderation. Messages with score above 0.70 are auto-queued for review, while lower-scored messages remain visible unless later user-reported. The model looked strong offline before launch, but over the last six weeks Trust & Safety has seen moderator escalations rise and user reports for harmful content increase even though the review queue volume has barely changed. You are asked to explain why the model appears to be failing in production and what you would do next.
| Metric | Offline Validation | Current Production |
|---|---|---|
| Precision @ 0.70 | 0.91 | 0.89 |
| Recall @ 0.70 | 0.84 | 0.58 |
| F1 Score | 0.87 | 0.70 |
| AUC-ROC | 0.95 | 0.86 |
| Log Loss | 0.19 | 0.37 |
| Expected Calibration Error | 0.03 | 0.14 |
| % messages sent to review | 1.8% | 1.7% |
| Harmful messages found via user reports | 420/day | 1,180/day |
How would you diagnose the production failure from these metrics, and what changes would you recommend to the evaluation approach, thresholding, and model iteration plan?