HealthShield uses a binary classifier to detect rare insurance fraud claims before payout. The target class is highly imbalanced: only 1% of 500,000 historical claims are fraudulent, and the current model is being judged internally using overall accuracy.
| Metric | Current Model | Naive Always-Negative Baseline |
|---|---|---|
| Fraud prevalence | 1.0% | 1.0% |
| Accuracy | 98.6% | 99.0% |
| Precision | 28.0% | 0.0% |
| Recall | 42.0% | 0.0% |
| F1 Score | 33.6% | 0.0% |
| AUC-ROC | 0.91 | 0.50 |
| PR-AUC | 0.31 | 0.01 |
| Claims flagged for review | 7,500 | 0 |
Leadership sees 98.6% accuracy and assumes the model is production-ready, but the fraud team argues that accuracy is misleading because the positive class is only 1%. You need to evaluate whether the model is actually useful, explain the tradeoffs created by the imbalance, and recommend how to improve both evaluation and model performance.