You own a binary classifier that predicts whether a customer support message should be escalated for urgent review on a digital banking platform. The current logistic regression model uses a 0.50 threshold, and leadership is focused on its 96.8% accuracy in offline validation. However, only a small share of messages are truly urgent, and operations reports that several high-risk cases were not escalated while some teams argue the model still looks strong because overall accuracy remains high. You are asked to explain when accuracy is misleading and when precision, recall, F1-score, or ROC-AUC should be the primary metric.
| Metric | Validation Set |
|---|---|
| Accuracy | 96.8% |
| Precision | 61.5% |
| Recall | 40.0% |
| F1 Score | 48.5% |
| ROC-AUC | 0.89 |
| Positive class rate | 3.0% |
| Threshold | 0.50 |
How would you interpret these results, and in what situations would you prioritize precision, recall, F1-score, or ROC-AUC over accuracy for this model?