You own a gradient-boosted credit risk model used in a consumer lending flow to predict 90-day default risk. Applications with scores above a 0.60 threshold are auto-declined, scores from 0.40 to 0.60 are sent to manual review, and lower scores are auto-approved. After a policy review, non-technical stakeholders asked why some applicants with similar income are receiving different decisions and whether the model is making understandable, trustworthy predictions. You need to explain individual predictions and overall model behavior using evidence from Azure Machine Learning evaluation outputs and recent production results.
| Metric | Validation | Production |
|---|---|---|
| AUC-ROC | 0.84 | 0.82 |
| Precision (decline class) | 0.71 | 0.69 |
| Recall (decline class) | 0.63 | 0.58 |
| F1 Score | 0.67 | 0.63 |
| Brier Score | 0.148 | 0.176 |
| Expected Calibration Error | 0.03 | 0.09 |
| Manual review rate | 18% | 24% |
| 90-day default rate among auto-approved | 2.9% | 4.1% |
How would you explain this model's predictions to business and compliance stakeholders in a way that is both technically accurate and actionable, and what do these results suggest about the reliability of the explanations you would present?