You own a gradient-boosted classifier that predicts whether a user will click a historical record hint in a genealogy product. In offline validation on the last 8 weeks of labeled sessions, the model looked strong and replaced a rules-based ranker; in production, hints shown in the main discovery surface are now getting fewer clicks and more dismissals. The serving threshold is 0.60, and any hint above threshold is promoted into the top 3 positions while lower-scoring hints remain below the fold. Product leadership wants to know why the model looked better offline but is underperforming after launch.
| Metric | Offline Validation | Production Week 1 |
|---|---|---|
| AUC-ROC | 0.91 | 0.79 |
| Precision @ threshold 0.60 | 0.74 | 0.58 |
| Recall @ threshold 0.60 | 0.68 | 0.41 |
| F1 Score | 0.71 | 0.48 |
| Calibration error | 0.03 | 0.14 |
| Predicted positive rate | 22% | 24% |
| Hint click-through rate | 18.6% | 12.1% |
| Hint dismissal rate | 7.9% | 13.8% |
How would you diagnose the gap between offline and production performance, and what changes would you recommend to improve both evaluation reliability and live model performance?