ShopSafe is building a binary classifier to detect fraudulent orders before fulfillment. Fraud is rare, so leadership is concerned that the current evaluation dashboard may overstate model quality by focusing on accuracy.
The team evaluated a logistic regression model on 100,000 recent orders. Only 1,000 orders were actually fraudulent.
| Metric | Value |
|---|---|
| Accuracy | 0.972 |
| Precision | 0.750 |
| Recall | 0.180 |
| F1 Score | 0.290 |
| AUC-ROC | 0.840 |
| Fraud prevalence | 0.010 |
Confusion matrix counts:
| Predicted Fraud | Predicted Legitimate | |
|---|---|---|
| Actual Fraud | 180 | 820 |
| Actual Legitimate | 60 | 98,940 |
The product manager sees 97.2% accuracy and believes the model is ready for launch. However, the risk team argues that the model is still weak because it misses most fraudulent orders. You need to explain what the F1 score means, why it matters here, and whether it is a better summary metric than accuracy.