You own a binary classifier that prioritizes suspicious account sign-ins for manual review in Microsoft Defender. The current logistic regression model and a new LightGBM challenger are both evaluated offline before deployment, and accounts scoring above a 0.40 threshold are sent to analysts. Security leadership notices the challenger has a slightly higher AUC-ROC, but the operations team prefers the current model because it produces better precision and F1 at the chosen threshold. You need to explain what each metric is actually measuring and which one should guide model selection for this use case.
| Metric | Current Model | Challenger Model |
|---|---|---|
| AUC-ROC | 0.91 | 0.94 |
| Precision @ 0.40 | 0.74 | 0.61 |
| Recall @ 0.40 | 0.68 | 0.79 |
| F1 Score @ 0.40 | 0.71 | 0.69 |
| False Positive Rate @ 0.40 | 0.032 | 0.071 |
| Daily alerts sent to analysts | 4,300 | 7,100 |
| Analyst review capacity/day | 5,000 | 5,000 |
| Positive class prevalence | 2.8% | 2.8% |
How would you explain the difference between AUC-ROC and F1-score using these results, and when would you prefer one over the other for selecting or tuning this model?