GitLab deployed a gradient boosted classifier in GitLab Duo to predict whether an AI-generated code suggestion will be accepted by a developer. Offline validation looked strong, so the model replaced a heuristic ranker in production. After launch, online acceptance and user satisfaction dropped, despite offline metrics remaining high on the holdout set.
| Metric | Offline Validation | Production (last 14 days) | Change |
|---|---|---|---|
| AUC-ROC | 0.91 | 0.74 | -0.17 |
| Precision @ serving threshold 0.60 | 0.84 | 0.58 | -0.26 |
| Recall @ serving threshold 0.60 | 0.76 | 0.41 | -0.35 |
| F1 Score | 0.80 | 0.48 | -0.32 |
| Log Loss | 0.29 | 0.67 | +0.38 |
| Predicted positive rate | 31% | 18% | -13 pts |
| Suggestion acceptance rate | 27% | 16% | -11 pts |
| User-reported “helpful” rate | 72% | 54% | -18 pts |
You need to explain why the model generalizes well offline but underperforms in production, and determine whether the issue is caused by data drift, label mismatch, thresholding, calibration, serving bugs, or evaluation leakage.