Evaluate Imbalanced Merge Request Risk

Context

GitLab has deployed a binary classifier to predict whether a merge request will cause a production incident within 7 days of deployment. The model is used in CI/CD to trigger extra review steps for high-risk changes. Incidents are rare, so the dataset is highly imbalanced.

The current model was trained on 1.2M historical merge requests, with a positive rate of 1.8%. On the latest validation set, the team reports strong overall accuracy, but SRE and engineering managers say too many incident-causing merge requests are still passing through without additional review.

Current Performance

Metric	Value
Positive rate	1.8%
Accuracy	98.1%
Precision	0.29
Recall	0.41
F1 Score	0.34
AUC-ROC	0.87
PR-AUC	0.31
Log Loss	0.096
Threshold	0.50

The Problem

The model looks strong if judged by accuracy alone, but the business outcome is poor: many risky merge requests are not being escalated. You need to explain how to evaluate this model correctly on an imbalanced dataset, which metrics matter most, and what changes you would recommend.

Requirements

Explain why accuracy is misleading in this setting.
Prioritize the evaluation metrics you would use and justify them.
Interpret the current metrics and diagnose the model's failure mode.
Recommend how you would tune the decision threshold.
Propose validation and monitoring steps before changing rollout policy.

Constraints

Extra review adds ~20 minutes of developer latency per flagged merge request.
Missing a true risky merge request can lead to production incidents and customer impact.
The model is used inside GitLab CI/CD, so threshold changes affect review volume immediately.

Context

Current Performance

Metric	Value
Positive rate	1.8%
Accuracy	98.1%
Precision	0.29
Recall	0.41
F1 Score	0.34
AUC-ROC	0.87
PR-AUC	0.31
Log Loss	0.096
Threshold	0.50

The Problem

Requirements

Explain why accuracy is misleading in this setting.
Prioritize the evaluation metrics you would use and justify them.
Interpret the current metrics and diagnose the model's failure mode.
Recommend how you would tune the decision threshold.
Propose validation and monitoring steps before changing rollout policy.

Constraints

Extra review adds ~20 minutes of developer latency per flagged merge request.
Missing a true risky merge request can lead to production incidents and customer impact.
The model is used inside GitLab CI/CD, so threshold changes affect review volume immediately.

Context

Current Performance

Metric	Value
Positive rate	1.8%
Accuracy	98.1%
Precision	0.29
Recall	0.41
F1 Score	0.34
AUC-ROC	0.87
PR-AUC	0.31
Log Loss	0.096
Threshold	0.50

The Problem

Requirements

Explain why accuracy is misleading in this setting.
Prioritize the evaluation metrics you would use and justify them.
Interpret the current metrics and diagnose the model's failure mode.
Recommend how you would tune the decision threshold.
Propose validation and monitoring steps before changing rollout policy.

Constraints

Extra review adds ~20 minutes of developer latency per flagged merge request.
Missing a true risky merge request can lead to production incidents and customer impact.
The model is used inside GitLab CI/CD, so threshold changes affect review volume immediately.

Context

Current Performance

Metric	Value
Positive rate	1.8%
Accuracy	98.1%
Precision	0.29
Recall	0.41
F1 Score	0.34
AUC-ROC	0.87
PR-AUC	0.31
Log Loss	0.096
Threshold	0.50

The Problem

Requirements

Explain why accuracy is misleading in this setting.
Prioritize the evaluation metrics you would use and justify them.
Interpret the current metrics and diagnose the model's failure mode.
Recommend how you would tune the decision threshold.
Propose validation and monitoring steps before changing rollout policy.

Constraints

Extra review adds ~20 minutes of developer latency per flagged merge request.
Missing a true risky merge request can lead to production incidents and customer impact.
The model is used inside GitLab CI/CD, so threshold changes affect review volume immediately.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Imbalanced Merge Request Risk

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Evaluate Imbalanced Merge Request Risk

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Imbalanced Merge Request Risk

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer