Design Offline Validation for Ranking Model

Context

ShopNow wants to replace its current product recommendation ranker on the homepage with a new gradient-boosted model that predicts 7-day purchase probability. Leadership wants an offline validation framework that can credibly estimate business value before running an online experiment. The challenge is that the model scores items shown in historical logs, where exposure was determined by the current production ranker.

Current Performance

Metric	Production Ranker	Candidate Model	Relative Change
AUC-ROC	0.681	0.742	+8.9%
Log Loss	0.412	0.366	-11.2%
Precision@10	0.084	0.097	+15.5%
Recall@10	0.211	0.246	+16.6%
Lift in Top 5%	2.1x	2.8x	+33.3%
Calibration error	0.061	0.118	Worse
Avg. score on purchased items	0.143	0.191	+33.6%

The Problem

The candidate model looks better on ranking metrics, but its probabilities are poorly calibrated and evaluation is based only on logged impressions from the old system. You need to design an offline validation framework that can demonstrate likely value, identify risks, and define what evidence is strong enough to justify an A/B test.

Requirements

Define the offline datasets, splits, and holdout strategy you would use.
Explain which metrics you would trust most for proving value and why.
Address selection bias from historical exposure under the production ranker.
Propose how to validate calibration, thresholding, and segment-level performance.
Recommend clear go/no-go criteria for launching an online test.

Constraints

120M logged impressions from the last 90 days
Strong seasonality around weekends and promotions
Only 8% of catalog items receive meaningful exposure
Online experiment capacity is limited to one test this quarter

Context

Current Performance

Metric	Production Ranker	Candidate Model	Relative Change
AUC-ROC	0.681	0.742	+8.9%
Log Loss	0.412	0.366	-11.2%
Precision@10	0.084	0.097	+15.5%
Recall@10	0.211	0.246	+16.6%
Lift in Top 5%	2.1x	2.8x	+33.3%
Calibration error	0.061	0.118	Worse
Avg. score on purchased items	0.143	0.191	+33.6%

The Problem

Requirements

Define the offline datasets, splits, and holdout strategy you would use.
Explain which metrics you would trust most for proving value and why.
Address selection bias from historical exposure under the production ranker.
Propose how to validate calibration, thresholding, and segment-level performance.
Recommend clear go/no-go criteria for launching an online test.

Constraints

120M logged impressions from the last 90 days
Strong seasonality around weekends and promotions
Only 8% of catalog items receive meaningful exposure
Online experiment capacity is limited to one test this quarter

Context

Current Performance

Metric	Production Ranker	Candidate Model	Relative Change
AUC-ROC	0.681	0.742	+8.9%
Log Loss	0.412	0.366	-11.2%
Precision@10	0.084	0.097	+15.5%
Recall@10	0.211	0.246	+16.6%
Lift in Top 5%	2.1x	2.8x	+33.3%
Calibration error	0.061	0.118	Worse
Avg. score on purchased items	0.143	0.191	+33.6%

The Problem

Requirements

Define the offline datasets, splits, and holdout strategy you would use.
Explain which metrics you would trust most for proving value and why.
Address selection bias from historical exposure under the production ranker.
Propose how to validate calibration, thresholding, and segment-level performance.
Recommend clear go/no-go criteria for launching an online test.

Constraints

120M logged impressions from the last 90 days
Strong seasonality around weekends and promotions
Only 8% of catalog items receive meaningful exposure
Online experiment capacity is limited to one test this quarter

Context

Current Performance

Metric	Production Ranker	Candidate Model	Relative Change
AUC-ROC	0.681	0.742	+8.9%
Log Loss	0.412	0.366	-11.2%
Precision@10	0.084	0.097	+15.5%
Recall@10	0.211	0.246	+16.6%
Lift in Top 5%	2.1x	2.8x	+33.3%
Calibration error	0.061	0.118	Worse
Avg. score on purchased items	0.143	0.191	+33.6%

The Problem

Requirements

Define the offline datasets, splits, and holdout strategy you would use.
Explain which metrics you would trust most for proving value and why.
Address selection bias from historical exposure under the production ranker.
Propose how to validate calibration, thresholding, and segment-level performance.
Recommend clear go/no-go criteria for launching an online test.

Constraints

120M logged impressions from the last 90 days
Strong seasonality around weekends and promotions
Only 8% of catalog items receive meaningful exposure
Online experiment capacity is limited to one test this quarter

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Design Offline Validation for Ranking Model

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Design Offline Validation for Ranking Model

Context

Current Performance

The Problem

Requirements

Constraints

Design Offline Validation for Ranking Model

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer