At ArenaPlay, a binary classification model predicts whether a player will exhibit a rare harmful behavior within the next 30 days. The event rate is only 0.1%: in a validation set of 1,000,000 players, only 1,000 are actual positives.
The team initially celebrated high accuracy, but operations reports show the model is missing many true cases while also generating too many alerts for the review team.
| Metric | Current Model | Baseline: Predict All Negative |
|---|---|---|
| Accuracy | 99.40% | 99.90% |
| Precision | 8.3% | 0.0% |
| Recall | 50.0% | 0.0% |
| F1 Score | 14.3% | 0.0% |
| AUC-ROC | 0.91 | 0.50 |
| PR AUC | 0.19 | 0.001 |
| Flagged players | 6,000 | 0 |
Leadership wants to know which metrics should be used to evaluate this model and whether the current model is good enough to deploy. The main concern is that accuracy looks excellent despite poor practical usefulness in a highly imbalanced setting.