Evaluate Calibration for Payment Risk

Context

BlueLedger uses a gradient-boosted classifier to predict the probability that a card payment will become a confirmed fraud or chargeback within 60 days. Scores above 0.80 are auto-declined, scores from 0.50 to 0.80 go to manual review, and lower scores are approved.

The model still ranks transactions reasonably well, but risk leaders are concerned that the predicted probabilities are not trustworthy enough for a high-stakes payments decision. In the last quarter, several score bands materially underpredicted realized fraud rates, creating avoidable losses and inconsistent review policies.

Current Performance

Metric	Current Model	Prior Quarter	Change
AUC-ROC	0.91	0.92	-0.01
Log Loss	0.184	0.156	+0.028
Brier Score	0.061	0.047	+0.014
Expected Calibration Error (ECE)	0.072	0.031	+0.041
Precision @ auto-decline threshold	0.88	0.90	-0.02
Recall @ auto-decline threshold	0.41	0.44	-0.03
Monthly fraud loss	$4.8M	$3.9M	+$0.9M

Calibration by Score Band

Predicted Score Band	Share of Txns	Avg Predicted Risk	Observed Fraud Rate
0.00-0.10	72%	0.03	0.02
0.10-0.30	18%	0.19	0.11
0.30-0.50	6%	0.39	0.21
0.50-0.80	3%	0.64	0.58
0.80-1.00	1%	0.91	0.97

Task

Assess whether the model is well calibrated for this decision setting.
Explain which metrics and plots you would use, and why ranking metrics alone are insufficient.
Diagnose the business risk created by the current miscalibration across score bands.
Recommend how to recalibrate or redesign decision thresholds.
Describe how you would validate calibration by segment before deployment.

Constraints

False declines create customer churn and merchant complaints.
Missed fraud has an average loss of $420 per transaction.
Manual review capacity is fixed at 18,000 transactions per day.

Context

Current Performance

Metric	Current Model	Prior Quarter	Change
AUC-ROC	0.91	0.92	-0.01
Log Loss	0.184	0.156	+0.028
Brier Score	0.061	0.047	+0.014
Expected Calibration Error (ECE)	0.072	0.031	+0.041
Precision @ auto-decline threshold	0.88	0.90	-0.02
Recall @ auto-decline threshold	0.41	0.44	-0.03
Monthly fraud loss	$4.8M	$3.9M	+$0.9M

Calibration by Score Band

Predicted Score Band	Share of Txns	Avg Predicted Risk	Observed Fraud Rate
0.00-0.10	72%	0.03	0.02
0.10-0.30	18%	0.19	0.11
0.30-0.50	6%	0.39	0.21
0.50-0.80	3%	0.64	0.58
0.80-1.00	1%	0.91	0.97

Task

Assess whether the model is well calibrated for this decision setting.
Explain which metrics and plots you would use, and why ranking metrics alone are insufficient.
Diagnose the business risk created by the current miscalibration across score bands.
Recommend how to recalibrate or redesign decision thresholds.
Describe how you would validate calibration by segment before deployment.

Constraints

False declines create customer churn and merchant complaints.
Missed fraud has an average loss of $420 per transaction.
Manual review capacity is fixed at 18,000 transactions per day.

Context

Current Performance

Metric	Current Model	Prior Quarter	Change
AUC-ROC	0.91	0.92	-0.01
Log Loss	0.184	0.156	+0.028
Brier Score	0.061	0.047	+0.014
Expected Calibration Error (ECE)	0.072	0.031	+0.041
Precision @ auto-decline threshold	0.88	0.90	-0.02
Recall @ auto-decline threshold	0.41	0.44	-0.03
Monthly fraud loss	$4.8M	$3.9M	+$0.9M

Calibration by Score Band

Predicted Score Band	Share of Txns	Avg Predicted Risk	Observed Fraud Rate
0.00-0.10	72%	0.03	0.02
0.10-0.30	18%	0.19	0.11
0.30-0.50	6%	0.39	0.21
0.50-0.80	3%	0.64	0.58
0.80-1.00	1%	0.91	0.97

Task

Assess whether the model is well calibrated for this decision setting.
Explain which metrics and plots you would use, and why ranking metrics alone are insufficient.
Diagnose the business risk created by the current miscalibration across score bands.
Recommend how to recalibrate or redesign decision thresholds.
Describe how you would validate calibration by segment before deployment.

Constraints

False declines create customer churn and merchant complaints.
Missed fraud has an average loss of $420 per transaction.
Manual review capacity is fixed at 18,000 transactions per day.

Context

Current Performance

Metric	Current Model	Prior Quarter	Change
AUC-ROC	0.91	0.92	-0.01
Log Loss	0.184	0.156	+0.028
Brier Score	0.061	0.047	+0.014
Expected Calibration Error (ECE)	0.072	0.031	+0.041
Precision @ auto-decline threshold	0.88	0.90	-0.02
Recall @ auto-decline threshold	0.41	0.44	-0.03
Monthly fraud loss	$4.8M	$3.9M	+$0.9M

Calibration by Score Band

Predicted Score Band	Share of Txns	Avg Predicted Risk	Observed Fraud Rate
0.00-0.10	72%	0.03	0.02
0.10-0.30	18%	0.19	0.11
0.30-0.50	6%	0.39	0.21
0.50-0.80	3%	0.64	0.58
0.80-1.00	1%	0.91	0.97

Task

Assess whether the model is well calibrated for this decision setting.
Explain which metrics and plots you would use, and why ranking metrics alone are insufficient.
Diagnose the business risk created by the current miscalibration across score bands.
Recommend how to recalibrate or redesign decision thresholds.
Describe how you would validate calibration by segment before deployment.

Constraints

False declines create customer churn and merchant complaints.
Missed fraud has an average loss of $420 per transaction.
Manual review capacity is fixed at 18,000 transactions per day.

Interview Guides

Context

Current Performance

Calibration by Score Band

Task

Constraints

Evaluate Calibration for Payment Risk

Context

Current Performance

Calibration by Score Band

Task

Constraints

Your Answer

Evaluate Calibration for Payment Risk

Context

Current Performance

Calibration by Score Band

Task

Constraints

Evaluate Calibration for Payment Risk

Context

Current Performance

Calibration by Score Band

Task

Constraints

Your Answer