GitLab has deployed a binary classifier to predict whether a merge request will cause a production incident within 7 days of deployment. The model is used in CI/CD to trigger extra review steps for high-risk changes. Incidents are rare, so the dataset is highly imbalanced.
The current model was trained on 1.2M historical merge requests, with a positive rate of 1.8%. On the latest validation set, the team reports strong overall accuracy, but SRE and engineering managers say too many incident-causing merge requests are still passing through without additional review.
| Metric | Value |
|---|---|
| Positive rate | 1.8% |
| Accuracy | 98.1% |
| Precision | 0.29 |
| Recall | 0.41 |
| F1 Score | 0.34 |
| AUC-ROC | 0.87 |
| PR-AUC | 0.31 |
| Log Loss | 0.096 |
| Threshold | 0.50 |
The model looks strong if judged by accuracy alone, but the business outcome is poor: many risky merge requests are not being escalated. You need to explain how to evaluate this model correctly on an imbalanced dataset, which metrics matter most, and what changes you would recommend.