MailShield uses a binary classification model to detect spam emails for small business inboxes. The team reports strong overall accuracy, but users still complain that too many spam messages reach the inbox while some legitimate emails are incorrectly filtered.
| Metric | Value |
|---|---|
| Accuracy | 0.962 |
| Precision | 0.780 |
| Recall | 0.650 |
| F1 Score | 0.709 |
| AUC-ROC | 0.901 |
| Spam prevalence in evaluation set | 8.0% |
| Evaluation set size | 50,000 emails |
| Predicted Spam | Predicted Not Spam | |
|---|---|---|
| Actual Spam | 2,600 | 1,400 |
| Actual Not Spam | 733 | 45,267 |
The product manager wants to know whether F1 is the right headline metric for this model and what the current F1 score actually says about model quality. Because spam is relatively rare, the team suspects accuracy is overstating performance.