Offline vs Online Safety Evaluation

Context

You’re the on-call ML scientist for StreamShare, a short-form video platform with 45M DAUs and a heavy teen user base. StreamShare uses a binary classifier to detect and block high-severity policy violations at upload time (e.g., self-harm encouragement, explicit sexual content involving minors, and credible threats). The model is a transformer-based text+vision ensemble that outputs a probability score p(violation) and then applies a threshold to decide whether to auto-block, send to human review, or allow.

The company has a strict internal goal: keep “severe harm exposure” below a fixed budget because of regulatory scrutiny (EU DSA / UK OSA) and advertiser requirements. The policy team defines a “severe harm incident” as: a violating piece of content that is viewed by at least one user before it is removed or age-gated. The Safety Ops team can review up to 120,000 items/day globally.

Three weeks ago, you shipped a new model (Model B) that looked better offline than the previous production model (Model A). However, Trust & Safety reports that user-reported severe incidents increased week-over-week after the launch, even though the model’s offline metrics improved.

Offline Evaluation Results (Holdout Set)

Holdout dataset: 10M labeled items from the last 60 days (labels from a mix of human review and post-hoc enforcement). Severe violations are rare.

Metric (offline)	Model A (prod)	Model B (new)
AUC-ROC	0.962	0.975
Precision @ current threshold	0.41	0.37
Recall @ current threshold	0.78	0.84
F1 @ current threshold	0.54	0.52
Calibration (ECE)	0.061	0.094
% items sent to review	1.10%	1.35%

Online Observations (First 10 days after rollout)

Traffic is split 50/50 by uploader (A/B). Policy and enforcement rules are unchanged.

Metric (online)	Model A	Model B
Severe harm incidents per 10K uploads	1.8	2.4
User reports per 10K uploads	14.2	16.1
Median time-to-action (minutes)	18	26
Review queue overflow rate	2%	11%
Creator appeal rate (blocked content)	3.1%	4.8%

The Problem

Despite better offline AUC and higher offline recall, Model B correlates with more severe harm exposure online, slower enforcement, and review capacity overflow. Leadership asks you two things:

Explain how you would evaluate this model offline vs. online in a way that is credible for safety.
Define “safety” as a metric (or set of metrics) that can be used for go/no-go decisions.

Requirements

Provide a structured answer that covers:

Offline evaluation plan: what datasets/splits you would use (including time-based splits), what labels are trustworthy, and what metrics you would prioritize for severe, rare classes.
Online evaluation plan: how you would run the experiment, what guardrails you would set, and how you would handle delayed labels and interference (e.g., review queue effects).
Define “safety” as a metric: propose at least two concrete safety metrics (one leading, one lagging) with explicit formulas and units (e.g., “incidents per 10K uploads”, “harm-weighted exposure minutes”).
Use the provided numbers to propose at least 3 plausible root causes for the offline/online mismatch.
Recommend thresholding / routing changes (auto-block vs review vs allow) that respect review capacity, and explain expected tradeoffs.

Constraints

Severe violations are <0.2% of uploads, and labels can take days (appeals, investigations).
Ops review capacity is capped at 120,000 items/day; overflow causes delays.
False positives have real cost: wrongful blocks increase creator churn and appeals.
False negatives have outsized cost: even a small increase in exposure can trigger regulatory action and advertiser pullback.

Context

Offline Evaluation Results (Holdout Set)

Holdout dataset: 10M labeled items from the last 60 days (labels from a mix of human review and post-hoc enforcement). Severe violations are rare.

Metric (offline)	Model A (prod)	Model B (new)
AUC-ROC	0.962	0.975
Precision @ current threshold	0.41	0.37
Recall @ current threshold	0.78	0.84
F1 @ current threshold	0.54	0.52
Calibration (ECE)	0.061	0.094
% items sent to review	1.10%	1.35%

Online Observations (First 10 days after rollout)

Traffic is split 50/50 by uploader (A/B). Policy and enforcement rules are unchanged.

Metric (online)	Model A	Model B
Severe harm incidents per 10K uploads	1.8	2.4
User reports per 10K uploads	14.2	16.1
Median time-to-action (minutes)	18	26
Review queue overflow rate	2%	11%
Creator appeal rate (blocked content)	3.1%	4.8%

The Problem

Explain how you would evaluate this model offline vs. online in a way that is credible for safety.
Define “safety” as a metric (or set of metrics) that can be used for go/no-go decisions.

Requirements

Provide a structured answer that covers:

Offline evaluation plan: what datasets/splits you would use (including time-based splits), what labels are trustworthy, and what metrics you would prioritize for severe, rare classes.
Online evaluation plan: how you would run the experiment, what guardrails you would set, and how you would handle delayed labels and interference (e.g., review queue effects).
Define “safety” as a metric: propose at least two concrete safety metrics (one leading, one lagging) with explicit formulas and units (e.g., “incidents per 10K uploads”, “harm-weighted exposure minutes”).
Use the provided numbers to propose at least 3 plausible root causes for the offline/online mismatch.
Recommend thresholding / routing changes (auto-block vs review vs allow) that respect review capacity, and explain expected tradeoffs.

Constraints

Severe violations are <0.2% of uploads, and labels can take days (appeals, investigations).
Ops review capacity is capped at 120,000 items/day; overflow causes delays.
False positives have real cost: wrongful blocks increase creator churn and appeals.
False negatives have outsized cost: even a small increase in exposure can trigger regulatory action and advertiser pullback.

Context

Offline Evaluation Results (Holdout Set)

Holdout dataset: 10M labeled items from the last 60 days (labels from a mix of human review and post-hoc enforcement). Severe violations are rare.

Metric (offline)	Model A (prod)	Model B (new)
AUC-ROC	0.962	0.975
Precision @ current threshold	0.41	0.37
Recall @ current threshold	0.78	0.84
F1 @ current threshold	0.54	0.52
Calibration (ECE)	0.061	0.094
% items sent to review	1.10%	1.35%

Online Observations (First 10 days after rollout)

Traffic is split 50/50 by uploader (A/B). Policy and enforcement rules are unchanged.

Metric (online)	Model A	Model B
Severe harm incidents per 10K uploads	1.8	2.4
User reports per 10K uploads	14.2	16.1
Median time-to-action (minutes)	18	26
Review queue overflow rate	2%	11%
Creator appeal rate (blocked content)	3.1%	4.8%

The Problem

Explain how you would evaluate this model offline vs. online in a way that is credible for safety.
Define “safety” as a metric (or set of metrics) that can be used for go/no-go decisions.

Requirements

Provide a structured answer that covers:

Offline evaluation plan: what datasets/splits you would use (including time-based splits), what labels are trustworthy, and what metrics you would prioritize for severe, rare classes.
Online evaluation plan: how you would run the experiment, what guardrails you would set, and how you would handle delayed labels and interference (e.g., review queue effects).
Define “safety” as a metric: propose at least two concrete safety metrics (one leading, one lagging) with explicit formulas and units (e.g., “incidents per 10K uploads”, “harm-weighted exposure minutes”).
Use the provided numbers to propose at least 3 plausible root causes for the offline/online mismatch.
Recommend thresholding / routing changes (auto-block vs review vs allow) that respect review capacity, and explain expected tradeoffs.

Constraints

Severe violations are <0.2% of uploads, and labels can take days (appeals, investigations).
Ops review capacity is capped at 120,000 items/day; overflow causes delays.
False positives have real cost: wrongful blocks increase creator churn and appeals.
False negatives have outsized cost: even a small increase in exposure can trigger regulatory action and advertiser pullback.

Context

Offline Evaluation Results (Holdout Set)

Holdout dataset: 10M labeled items from the last 60 days (labels from a mix of human review and post-hoc enforcement). Severe violations are rare.

Metric (offline)	Model A (prod)	Model B (new)
AUC-ROC	0.962	0.975
Precision @ current threshold	0.41	0.37
Recall @ current threshold	0.78	0.84
F1 @ current threshold	0.54	0.52
Calibration (ECE)	0.061	0.094
% items sent to review	1.10%	1.35%

Online Observations (First 10 days after rollout)

Traffic is split 50/50 by uploader (A/B). Policy and enforcement rules are unchanged.

Metric (online)	Model A	Model B
Severe harm incidents per 10K uploads	1.8	2.4
User reports per 10K uploads	14.2	16.1
Median time-to-action (minutes)	18	26
Review queue overflow rate	2%	11%
Creator appeal rate (blocked content)	3.1%	4.8%

The Problem

Explain how you would evaluate this model offline vs. online in a way that is credible for safety.
Define “safety” as a metric (or set of metrics) that can be used for go/no-go decisions.

Requirements

Provide a structured answer that covers:

Offline evaluation plan: what datasets/splits you would use (including time-based splits), what labels are trustworthy, and what metrics you would prioritize for severe, rare classes.
Online evaluation plan: how you would run the experiment, what guardrails you would set, and how you would handle delayed labels and interference (e.g., review queue effects).
Define “safety” as a metric: propose at least two concrete safety metrics (one leading, one lagging) with explicit formulas and units (e.g., “incidents per 10K uploads”, “harm-weighted exposure minutes”).
Use the provided numbers to propose at least 3 plausible root causes for the offline/online mismatch.
Recommend thresholding / routing changes (auto-block vs review vs allow) that respect review capacity, and explain expected tradeoffs.

Constraints

Severe violations are <0.2% of uploads, and labels can take days (appeals, investigations).
Ops review capacity is capped at 120,000 items/day; overflow causes delays.
False positives have real cost: wrongful blocks increase creator churn and appeals.
False negatives have outsized cost: even a small increase in exposure can trigger regulatory action and advertiser pullback.

Interview Guides

Context

Offline Evaluation Results (Holdout Set)

Online Observations (First 10 days after rollout)

The Problem

Requirements

Constraints

Offline vs Online Safety Evaluation

Context

Offline Evaluation Results (Holdout Set)

Online Observations (First 10 days after rollout)

The Problem

Requirements

Constraints

Your Answer

Offline vs Online Safety Evaluation

Context

Offline Evaluation Results (Holdout Set)

Online Observations (First 10 days after rollout)

The Problem

Requirements

Constraints

Offline vs Online Safety Evaluation

Context

Offline Evaluation Results (Holdout Set)

Online Observations (First 10 days after rollout)

The Problem

Requirements

Constraints

Your Answer