ShopNow wants to replace its current product recommendation ranker on the homepage with a new gradient-boosted model that predicts 7-day purchase probability. Leadership wants an offline validation framework that can credibly estimate business value before running an online experiment. The challenge is that the model scores items shown in historical logs, where exposure was determined by the current production ranker.
| Metric | Production Ranker | Candidate Model | Relative Change |
|---|---|---|---|
| AUC-ROC | 0.681 | 0.742 | +8.9% |
| Log Loss | 0.412 | 0.366 | -11.2% |
| Precision@10 | 0.084 | 0.097 | +15.5% |
| Recall@10 | 0.211 | 0.246 | +16.6% |
| Lift in Top 5% | 2.1x | 2.8x | +33.3% |
| Calibration error | 0.061 | 0.118 | Worse |
| Avg. score on purchased items | 0.143 | 0.191 | +33.6% |
The candidate model looks better on ranking metrics, but its probabilities are poorly calibrated and evaluation is based only on logged impressions from the old system. You need to design an offline validation framework that can demonstrate likely value, identify risks, and define what evidence is strong enough to justify an A/B test.