NorthStar Bank deployed a gradient boosting model to predict whether a small-business loan applicant will default within 12 months. The model replaced a logistic regression scorecard and is now used to auto-approve, reject, or route applications for manual underwriting.
After two quarters in production, leadership sees mixed results: default losses are slightly lower, but approval rates fell and underwriters report too many borderline cases being escalated. You need to assess whether the model is actually performing well and how its accuracy should be verified beyond a single headline metric.
| Metric | Validation Set | Production (last 60 days) | Baseline Scorecard |
|---|---|---|---|
| Accuracy | 0.842 | 0.801 | 0.776 |
| Precision | 0.691 | 0.648 | 0.571 |
| Recall | 0.583 | 0.472 | 0.514 |
| F1 Score | 0.632 | 0.545 | 0.541 |
| AUC-ROC | 0.861 | 0.823 | 0.781 |
| Default Rate | 0.182 | 0.214 | 0.214 |
| Manual Review Rate | 18.0% | 27.4% | 16.1% |
The model looks better than the baseline on AUC and accuracy, but production recall dropped materially while manual reviews increased. The bank wants to know whether the model is truly better, what verification steps are missing, and what should be improved before expanding auto-decisioning.