Debug Failing Content Safety Model

Scenario

You own a gradient-boosted binary classifier that scores whether a newly posted message on a social platform should be sent to human moderation. Messages with score above 0.70 are auto-queued for review, while lower-scored messages remain visible unless later user-reported. The model looked strong offline before launch, but over the last six weeks Trust & Safety has seen moderator escalations rise and user reports for harmful content increase even though the review queue volume has barely changed. You are asked to explain why the model appears to be failing in production and what you would do next.

Performance Data

Metric	Offline Validation	Current Production
Precision @ 0.70	0.91	0.89
Recall @ 0.70	0.84	0.58
F1 Score	0.87	0.70
AUC-ROC	0.95	0.86
Log Loss	0.19	0.37
Expected Calibration Error	0.03	0.14
% messages sent to review	1.8%	1.7%
Harmful messages found via user reports	420/day	1,180/day

Question

How would you diagnose the production failure from these metrics, and what changes would you recommend to the evaluation approach, thresholding, and model iteration plan?

Scenario

Metric

Offline Validation

Current Production

Precision @ 0.70

0.91

0.89

Recall @ 0.70

0.84

0.58

F1 Score

0.87

0.70

AUC-ROC

0.95

0.86

Log Loss

0.19

0.37

Expected Calibration Error

0.03

0.14

% messages sent to review

1.8%

1.7%

Harmful messages found via user reports

420/day

1,180/day

Scenario

Metric

Offline Validation

Current Production

Precision @ 0.70

0.91

0.89

Recall @ 0.70

0.84

0.58

F1 Score

0.87

0.70

AUC-ROC

0.95

0.86

Log Loss

0.19

0.37

Expected Calibration Error

0.03

0.14

% messages sent to review

1.8%

1.7%

Harmful messages found via user reports

420/day

1,180/day

Scenario

Metric

Offline Validation

Current Production

Precision @ 0.70

0.91

0.89

Recall @ 0.70

0.84

0.58

F1 Score

0.87

0.70

AUC-ROC

0.95

0.86

Log Loss

0.19

0.37

Expected Calibration Error

0.03

0.14

% messages sent to review

1.8%

1.7%

Harmful messages found via user reports

420/day

1,180/day

Problem

Scenario

Performance Data

Question

Problem

Scenario

Performance Data

Question

Debug Failing Content Safety Model

Problem

Scenario

Performance Data

Question

Problem

Scenario

Performance Data

Question