Diagnose Offline-Online Recommendation Failure

Scenario

You own a gradient-boosted classifier that predicts whether a user will click a historical record hint in a genealogy product. In offline validation on the last 8 weeks of labeled sessions, the model looked strong and replaced a rules-based ranker; in production, hints shown in the main discovery surface are now getting fewer clicks and more dismissals. The serving threshold is 0.60, and any hint above threshold is promoted into the top 3 positions while lower-scoring hints remain below the fold. Product leadership wants to know why the model looked better offline but is underperforming after launch.

Performance Data

Metric	Offline Validation	Production Week 1
AUC-ROC	0.91	0.79
Precision @ threshold 0.60	0.74	0.58
Recall @ threshold 0.60	0.68	0.41
F1 Score	0.71	0.48
Calibration error	0.03	0.14
Predicted positive rate	22%	24%
Hint click-through rate	18.6%	12.1%
Hint dismissal rate	7.9%	13.8%

Question

How would you diagnose the gap between offline and production performance, and what changes would you recommend to improve both evaluation reliability and live model performance?

Scenario

Metric

Offline Validation

Production Week 1

AUC-ROC

0.91

0.79

Precision @ threshold 0.60

0.74

0.58

Recall @ threshold 0.60

0.68

0.41

F1 Score

0.71

0.48

Calibration error

0.03

0.14

Predicted positive rate

22%

24%

Hint click-through rate

18.6%

12.1%

Hint dismissal rate

7.9%

13.8%

Scenario

Metric

Offline Validation

Production Week 1

AUC-ROC

0.91

0.79

Precision @ threshold 0.60

0.74

0.58

Recall @ threshold 0.60

0.68

0.41

F1 Score

0.71

0.48

Calibration error

0.03

0.14

Predicted positive rate

22%

24%

Hint click-through rate

18.6%

12.1%

Hint dismissal rate

7.9%

13.8%

Scenario

Metric

Offline Validation

Production Week 1

AUC-ROC

0.91

0.79

Precision @ threshold 0.60

0.74

0.58

Recall @ threshold 0.60

0.68

0.41

F1 Score

0.71

0.48

Calibration error

0.03

0.14

Predicted positive rate

22%

24%

Hint click-through rate

18.6%

12.1%

Hint dismissal rate

7.9%

13.8%

Interview Guides

Scenario

Performance Data

Question

Diagnose Offline-Online Recommendation Failure

Scenario

Performance Data

Question

Your Answer

Diagnose Offline-Online Recommendation Failure

Scenario

Performance Data

Question

Diagnose Offline-Online Recommendation Failure

Scenario

Performance Data

Question

Your Answer