Diagnose Offline-Online Performance Gap | Dataford Interview Questions

Context

GitLab deployed a gradient boosted classifier in GitLab Duo to predict whether an AI-generated code suggestion will be accepted by a developer. Offline validation looked strong, so the model replaced a heuristic ranker in production. After launch, online acceptance and user satisfaction dropped, despite offline metrics remaining high on the holdout set.

Current Performance

Metric	Offline Validation	Production (last 14 days)	Change
AUC-ROC	0.91	0.74	-0.17
Precision @ serving threshold 0.60	0.84	0.58	-0.26
Recall @ serving threshold 0.60	0.76	0.41	-0.35
F1 Score	0.80	0.48	-0.32
Log Loss	0.29	0.67	+0.38
Predicted positive rate	31%	18%	-13 pts
Suggestion acceptance rate	27%	16%	-11 pts
User-reported “helpful” rate	72%	54%	-18 pts

The Problem

You need to explain why the model generalizes well offline but underperforms in production, and determine whether the issue is caused by data drift, label mismatch, thresholding, calibration, serving bugs, or evaluation leakage.

Requirements

Interpret the metric pattern and identify the most likely failure modes.
Propose a step-by-step evaluation plan to isolate whether the gap comes from data, labels, model scoring, or serving.
Recommend how to validate your hypotheses using GitLab production logs and delayed labels.
Suggest concrete model or evaluation changes to recover online performance.
Explain what you would monitor going forward and which metrics should trigger rollback.

Constraints

Acceptance labels arrive with a 7-day delay.
The model serves under a 120 ms p95 latency budget.
Product wants to avoid a large increase in low-quality GitLab Duo suggestions.

Context

Current Performance

Metric	Offline Validation	Production (last 14 days)	Change
AUC-ROC	0.91	0.74	-0.17
Precision @ serving threshold 0.60	0.84	0.58	-0.26
Recall @ serving threshold 0.60	0.76	0.41	-0.35
F1 Score	0.80	0.48	-0.32
Log Loss	0.29	0.67	+0.38
Predicted positive rate	31%	18%	-13 pts
Suggestion acceptance rate	27%	16%	-11 pts
User-reported “helpful” rate	72%	54%	-18 pts

Requirements

Interpret the metric pattern and identify the most likely failure modes.

Propose a step-by-step evaluation plan to isolate whether the gap comes from data, labels, model scoring, or serving.

Recommend how to validate your hypotheses using GitLab production logs and delayed labels.

Suggest concrete model or evaluation changes to recover online performance.

Explain what you would monitor going forward and which metrics should trigger rollback.