Dataford
Interview Guides
Upgrade
All questions/Model Evaluation/Debug Failing Content Safety Model

Debug Failing Content Safety Model

Hard
Model Evaluation
Asked at 1 company1CalibrationThreshold TuningDiagnosis
Also asked at
Discord

Problem

Scenario

You own a gradient-boosted binary classifier that scores whether a newly posted message on a social platform should be sent to human moderation. Messages with score above 0.70 are auto-queued for review, while lower-scored messages remain visible unless later user-reported. The model looked strong offline before launch, but over the last six weeks Trust & Safety has seen moderator escalations rise and user reports for harmful content increase even though the review queue volume has barely changed. You are asked to explain why the model appears to be failing in production and what you would do next.

Performance Data

MetricOffline ValidationCurrent Production
Precision @ 0.700.910.89
Recall @ 0.700.840.58
F1 Score0.870.70
AUC-ROC0.950.86
Log Loss0.190.37
Expected Calibration Error0.030.14
% messages sent to review1.8%1.7%
Harmful messages found via user reports420/day1,180/day

Question

How would you diagnose the production failure from these metrics, and what changes would you recommend to the evaluation approach, thresholding, and model iteration plan?

Problem

Scenario

You own a gradient-boosted binary classifier that scores whether a newly posted message on a social platform should be sent to human moderation. Messages with score above 0.70 are auto-queued for review, while lower-scored messages remain visible unless later user-reported. The model looked strong offline before launch, but over the last six weeks Trust & Safety has seen moderator escalations rise and user reports for harmful content increase even though the review queue volume has barely changed. You are asked to explain why the model appears to be failing in production and what you would do next.

Performance Data

MetricOffline ValidationCurrent Production
Precision @ 0.700.910.89
Recall @ 0.700.840.58
F1 Score0.870.70
AUC-ROC0.950.86
Log Loss0.190.37
Expected Calibration Error0.030.14
% messages sent to review1.8%1.7%
Harmful messages found via user reports420/day1,180/day

Question

How would you diagnose the production failure from these metrics, and what changes would you recommend to the evaluation approach, thresholding, and model iteration plan?

Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
American ExpressDebug Production Recall CollapseHardAdobeDiagnose Creative Asset Moderation DriftMediumEvaluate Harmful Content ClassifierMedium
Next question