Evaluate Calibration in Search Ranking

Scenario

You own a gradient-boosted ranking model that scores historical record hints shown in a genealogy search experience. The model outputs a click probability used both to rank hints and to suppress low-confidence items below a 0.10 threshold. Offline ranking metrics looked strong, but product partners now report that downstream surfaces using the score as a probability are over-triggering because users click far less often than the model predicts in some score bands. You need to determine whether the model is well calibrated, not just well ranked.

Performance Data

Metric	Validation	Production
AUC-ROC	0.842	0.836
NDCG@10	0.781	0.774
Log Loss	0.412	0.487
Brier Score	0.128	0.161
Avg predicted CTR	0.312	0.309
Actual CTR	0.305	0.241
ECE	0.021	0.087
P(pred)=0.8-0.9, observed CTR	0.84	0.61
P(pred)=0.6-0.7, observed CTR	0.65	0.49
P(pred)=0.2-0.3, observed CTR	0.24	0.22

Question

How would you measure and diagnose calibration for this model, and what would you recommend if ranking performance remains acceptable but the predicted probabilities are systematically too high in production?

Scenario

Metric

Validation

Production

AUC-ROC

0.842

0.836

NDCG@10

0.781

0.774

Log Loss

0.412

0.487

Brier Score

0.128

0.161

Avg predicted CTR

0.312

0.309

Actual CTR

0.305

0.241

ECE

0.021

0.087

P(pred)=0.8-0.9, observed CTR

0.84

0.61

P(pred)=0.6-0.7, observed CTR

0.65

0.49

P(pred)=0.2-0.3, observed CTR

0.24

0.22

Scenario

Metric

Validation

Production

AUC-ROC

0.842

0.836

NDCG@10

0.781

0.774

Log Loss

0.412

0.487

Brier Score

0.128

0.161

Avg predicted CTR

0.312

0.309

Actual CTR

0.305

0.241

ECE

0.021

0.087

P(pred)=0.8-0.9, observed CTR

0.84

0.61

P(pred)=0.6-0.7, observed CTR

0.65

0.49

P(pred)=0.2-0.3, observed CTR

0.24

0.22

Scenario

Metric

Validation

Production

AUC-ROC

0.842

0.836

NDCG@10

0.781

0.774

Log Loss

0.412

0.487

Brier Score

0.128

0.161

Avg predicted CTR

0.312

0.309

Actual CTR

0.305

0.241

ECE

0.021

0.087

P(pred)=0.8-0.9, observed CTR

0.84

0.61

P(pred)=0.6-0.7, observed CTR

0.65

0.49

P(pred)=0.2-0.3, observed CTR

0.24

0.22

Interview Guides

Scenario

Performance Data

Question

Evaluate Calibration in Search Ranking

Scenario

Performance Data

Question

Your Answer

Evaluate Calibration in Search Ranking

Scenario

Performance Data

Question

Evaluate Calibration in Search Ranking

Scenario

Performance Data

Question

Your Answer