You own a gradient-boosted ranking model that scores historical record hints shown in a genealogy search experience. The model outputs a click probability used both to rank hints and to suppress low-confidence items below a 0.10 threshold. Offline ranking metrics looked strong, but product partners now report that downstream surfaces using the score as a probability are over-triggering because users click far less often than the model predicts in some score bands. You need to determine whether the model is well calibrated, not just well ranked.
| Metric | Validation | Production |
|---|---|---|
| AUC-ROC | 0.842 | 0.836 |
| NDCG@10 | 0.781 | 0.774 |
| Log Loss | 0.412 | 0.487 |
| Brier Score | 0.128 | 0.161 |
| Avg predicted CTR | 0.312 | 0.309 |
| Actual CTR | 0.305 | 0.241 |
| ECE | 0.021 | 0.087 |
| P(pred)=0.8-0.9, observed CTR | 0.84 | 0.61 |
| P(pred)=0.6-0.7, observed CTR | 0.65 | 0.49 |
| P(pred)=0.2-0.3, observed CTR | 0.24 | 0.22 |
How would you measure and diagnose calibration for this model, and what would you recommend if ranking performance remains acceptable but the predicted probabilities are systematically too high in production?