Debug Production Recall Collapse

Scenario

You own a binary classifier that ranks potentially abusive marketplace listings before moderation on a consumer platform. The model is a fine-tuned gradient-boosted tree ensemble trained offline and deployed in Sony's MLOps stack, with listings above a 0.60 score sent to manual review and the rest auto-approved. Offline validation looked strong, but six weeks after launch the trust-and-safety team reports a rise in user complaints and moderator escalations, even though the review queue quality still feels acceptable. You are asked to explain why the model generalizes well on training and validation data but underperforms in production.

Performance Data

Metric	Offline Validation	Production (last 14 days)
Accuracy	0.94	0.91
Precision	0.88	0.86
Recall	0.81	0.57
F1 Score	0.84	0.68
AUC-ROC	0.93	0.84
Log Loss	0.19	0.34
Positive prediction rate	9.8%	6.1%
True positive rate in new sellers	0.79	0.42
Abuse prevalence	11.2%	14.9%

Question

How would you diagnose the gap between offline and production performance, and what changes would you recommend to restore production quality without overwhelming the moderation team?

Scenario

Metric

Offline Validation

Production (last 14 days)

Accuracy

0.94

0.91

Precision

0.88

0.86

Recall

0.81

0.57

F1 Score

0.84

0.68

AUC-ROC

0.93

0.84

Log Loss

0.19

0.34

Positive prediction rate

9.8%

6.1%

True positive rate in new sellers

0.79

0.42

Abuse prevalence

11.2%

14.9%

Scenario

Metric

Offline Validation

Production (last 14 days)

Accuracy

0.94

0.91

Precision

0.88

0.86

Recall

0.81

0.57

F1 Score

0.84

0.68

AUC-ROC

0.93

0.84

Log Loss

0.19

0.34

Positive prediction rate

9.8%

6.1%

True positive rate in new sellers

0.79

0.42

Abuse prevalence

11.2%

14.9%

Scenario

Metric

Offline Validation

Production (last 14 days)

Accuracy

0.94

0.91

Precision

0.88

0.86

Recall

0.81

0.57

F1 Score

0.84

0.68

AUC-ROC

0.93

0.84

Log Loss

0.19

0.34

Positive prediction rate

9.8%

6.1%

True positive rate in new sellers

0.79

0.42

Abuse prevalence

11.2%

14.9%

Interview Guides

Scenario

Performance Data

Question

Debug Production Recall Collapse

Scenario

Performance Data

Question

Your Answer

Debug Production Recall Collapse

Scenario

Performance Data

Question

Debug Production Recall Collapse

Scenario

Performance Data

Question

Your Answer