Dataford
Interview Guides
Upgrade
All questions/Model Evaluation/Diagnose Offline-Online Performance Gap

Diagnose Offline-Online Performance Gap

Hard
Model Evaluation
Asked at 18 companies18AUC-ROCCalibrationThreshold Tuning
Also asked at
NVIDIARolls-RoyceAOpenAIAncestryGoogle

Problem

Context

GitLab deployed a gradient boosted classifier in GitLab Duo to predict whether an AI-generated code suggestion will be accepted by a developer. Offline validation looked strong, so the model replaced a heuristic ranker in production. After launch, online acceptance and user satisfaction dropped, despite offline metrics remaining high on the holdout set.

Current Performance

MetricOffline ValidationProduction (last 14 days)Change
AUC-ROC0.910.74-0.17
Precision @ serving threshold 0.600.840.58-0.26
Recall @ serving threshold 0.600.760.41-0.35
F1 Score0.800.48-0.32
Log Loss0.290.67+0.38
Predicted positive rate31%18%-13 pts
Suggestion acceptance rate27%16%-11 pts
User-reported “helpful” rate72%54%-18 pts

The Problem

You need to explain why the model generalizes well offline but underperforms in production, and determine whether the issue is caused by data drift, label mismatch, thresholding, calibration, serving bugs, or evaluation leakage.

Requirements

  1. Interpret the metric pattern and identify the most likely failure modes.
  2. Propose a step-by-step evaluation plan to isolate whether the gap comes from data, labels, model scoring, or serving.
  3. Recommend how to validate your hypotheses using GitLab production logs and delayed labels.
  4. Suggest concrete model or evaluation changes to recover online performance.
  5. Explain what you would monitor going forward and which metrics should trigger rollback.

Constraints

  • Acceptance labels arrive with a 7-day delay.
  • The model serves under a 120 ms p95 latency budget.
  • Product wants to avoid a large increase in low-quality GitLab Duo suggestions.

Problem

Context

GitLab deployed a gradient boosted classifier in GitLab Duo to predict whether an AI-generated code suggestion will be accepted by a developer. Offline validation looked strong, so the model replaced a heuristic ranker in production. After launch, online acceptance and user satisfaction dropped, despite offline metrics remaining high on the holdout set.

Current Performance

MetricOffline ValidationProduction (last 14 days)Change
AUC-ROC0.910.74-0.17
Precision @ serving threshold 0.600.840.58-0.26
Recall @ serving threshold 0.600.760.41-0.35
F1 Score0.800.48-0.32
Log Loss0.290.67+0.38
Predicted positive rate31%18%-13 pts
Suggestion acceptance rate27%16%-11 pts
User-reported “helpful” rate72%54%-18 pts

The Problem

You need to explain why the model generalizes well offline but underperforms in production, and determine whether the issue is caused by data drift, label mismatch, thresholding, calibration, serving bugs, or evaluation leakage.

Requirements

  1. Interpret the metric pattern and identify the most likely failure modes.
  2. Propose a step-by-step evaluation plan to isolate whether the gap comes from data, labels, model scoring, or serving.
  3. Recommend how to validate your hypotheses using GitLab production logs and delayed labels.
  4. Suggest concrete model or evaluation changes to recover online performance.
  5. Explain what you would monitor going forward and which metrics should trigger rollback.

Constraints

  • Acceptance labels arrive with a 7-day delay.
  • The model serves under a 120 ms p95 latency budget.
  • Product wants to avoid a large increase in low-quality GitLab Duo suggestions.
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
NovartisDiagnose Production Model Performance DropHardAncestryDiagnose Offline-Online Recommendation FailureMediumPersistent SystemsDesign Offline Validation for Ranking ModelEasy
Next question