Diagnose Misaligned Sepsis Risk Predictions

Context

You’re the on-call ML scientist for NorthBridge Health, a 12-hospital network (≈3.5M ED + inpatient encounters/year). The organization deployed a sepsis early warning model into the EHR to help clinicians identify patients at risk of developing sepsis within the next 12 hours. The model runs every 30 minutes for admitted patients and produces a risk score (0–1). If the score exceeds a threshold, the EHR triggers an alert to the bedside nurse and the covering physician.

The model was trained on 24 months of retrospective data (≈1.8M admissions). Labels were derived from a Sepsis-3 proxy: suspected infection + organ dysfunction within a time window. The model is a gradient-boosted tree ensemble using vitals, labs, demographics, comorbidities, and limited text-derived features (problem list and triage note keywords). Deployment began 10 weeks ago.

Current Performance Snapshot

NorthBridge’s CMIO reports: “The model looks great in validation, but alerts don’t match what we see clinically. We’re getting too many alerts on stable patients, and we missed several patients who later required ICU transfer for sepsis.” Clinicians are losing trust, and the hospital is considering disabling alerts.

Offline vs. Live Metrics (last 4 weeks)

Metric	Offline Test (held-out retrospective)	Live (prospective, adjudicated sample)
AUC-ROC	0.91	0.86
AUPRC	0.34	0.18
Sensitivity (Recall) @ current threshold	0.78	0.52
Precision (PPV) @ current threshold	0.22	0.11
Specificity @ current threshold	0.93	0.90
Brier score (calibration)	0.072	0.118
% of patients alerted at least once	7.5%	12.8%
Median lead time (TPs only)	5.1 hours	2.0 hours

Operational / Clinical Outcomes (last 4 weeks)

Average daily inpatients scored: 9,500
Average daily alerts fired: 1,200
Rapid response calls: +9% vs prior quarter
Sepsis-related ICU transfers: flat overall, but 5 high-profile misses reviewed by the safety committee

The Problem

The model’s predictions are not aligning with clinical outcomes: it fires frequently on patients who do not deteriorate, and it misses some patients who do. You need to diagnose whether this is primarily a metric/threshold issue, label/definition mismatch, calibration drift, data leakage in offline evaluation, distribution shift, or workflow/measurement artifacts.

Your Task (what you must cover)

Reconcile why AUC-ROC remains relatively high while precision, recall, and lead time degrade in live use.
Propose a structured evaluation plan to determine whether the issue is (a) label misalignment, (b) calibration/thresholding, (c) distribution shift, (d) temporal leakage, or (e) workflow effects.
Using the confusion matrix below, quantify the clinical and operational cost of FP vs FN, and recommend whether to move the threshold.
Recommend at least 4 concrete improvements spanning model, data, and deployment (not just “retrain”).
Define what “success” should mean in production: which metrics to track weekly, and how you would validate improvements without harming patient safety.

Constraints

Alert fatigue is a major safety risk: nursing leadership requires ≤800 alerts/day across the network.
False negatives are high severity (missed sepsis can lead to mortality), but false positives consume staff time and can increase unnecessary antibiotics.
Ground truth is delayed/noisy: final sepsis adjudication may take 7–14 days, and coding practices vary by hospital.
You cannot run an RCT immediately; the fastest option is a stepped-wedge rollout or silent mode evaluation for 2–4 weeks.

Context

Current Performance Snapshot

Offline vs. Live Metrics (last 4 weeks)

Metric	Offline Test (held-out retrospective)	Live (prospective, adjudicated sample)
AUC-ROC	0.91	0.86
AUPRC	0.34	0.18
Sensitivity (Recall) @ current threshold	0.78	0.52
Precision (PPV) @ current threshold	0.22	0.11
Specificity @ current threshold	0.93	0.90
Brier score (calibration)	0.072	0.118
% of patients alerted at least once	7.5%	12.8%
Median lead time (TPs only)	5.1 hours	2.0 hours

Operational / Clinical Outcomes (last 4 weeks)

Average daily inpatients scored: 9,500
Average daily alerts fired: 1,200
Rapid response calls: +9% vs prior quarter
Sepsis-related ICU transfers: flat overall, but 5 high-profile misses reviewed by the safety committee

The Problem

Your Task (what you must cover)

Reconcile why AUC-ROC remains relatively high while precision, recall, and lead time degrade in live use.
Propose a structured evaluation plan to determine whether the issue is (a) label misalignment, (b) calibration/thresholding, (c) distribution shift, (d) temporal leakage, or (e) workflow effects.
Using the confusion matrix below, quantify the clinical and operational cost of FP vs FN, and recommend whether to move the threshold.
Recommend at least 4 concrete improvements spanning model, data, and deployment (not just “retrain”).
Define what “success” should mean in production: which metrics to track weekly, and how you would validate improvements without harming patient safety.

Constraints

Alert fatigue is a major safety risk: nursing leadership requires ≤800 alerts/day across the network.
False negatives are high severity (missed sepsis can lead to mortality), but false positives consume staff time and can increase unnecessary antibiotics.
Ground truth is delayed/noisy: final sepsis adjudication may take 7–14 days, and coding practices vary by hospital.
You cannot run an RCT immediately; the fastest option is a stepped-wedge rollout or silent mode evaluation for 2–4 weeks.

Context

Current Performance Snapshot

Offline vs. Live Metrics (last 4 weeks)

Metric	Offline Test (held-out retrospective)	Live (prospective, adjudicated sample)
AUC-ROC	0.91	0.86
AUPRC	0.34	0.18
Sensitivity (Recall) @ current threshold	0.78	0.52
Precision (PPV) @ current threshold	0.22	0.11
Specificity @ current threshold	0.93	0.90
Brier score (calibration)	0.072	0.118
% of patients alerted at least once	7.5%	12.8%
Median lead time (TPs only)	5.1 hours	2.0 hours

Operational / Clinical Outcomes (last 4 weeks)

Average daily inpatients scored: 9,500
Average daily alerts fired: 1,200
Rapid response calls: +9% vs prior quarter
Sepsis-related ICU transfers: flat overall, but 5 high-profile misses reviewed by the safety committee

The Problem

Your Task (what you must cover)

Reconcile why AUC-ROC remains relatively high while precision, recall, and lead time degrade in live use.
Propose a structured evaluation plan to determine whether the issue is (a) label misalignment, (b) calibration/thresholding, (c) distribution shift, (d) temporal leakage, or (e) workflow effects.
Using the confusion matrix below, quantify the clinical and operational cost of FP vs FN, and recommend whether to move the threshold.
Recommend at least 4 concrete improvements spanning model, data, and deployment (not just “retrain”).
Define what “success” should mean in production: which metrics to track weekly, and how you would validate improvements without harming patient safety.

Constraints

Alert fatigue is a major safety risk: nursing leadership requires ≤800 alerts/day across the network.
False negatives are high severity (missed sepsis can lead to mortality), but false positives consume staff time and can increase unnecessary antibiotics.
Ground truth is delayed/noisy: final sepsis adjudication may take 7–14 days, and coding practices vary by hospital.
You cannot run an RCT immediately; the fastest option is a stepped-wedge rollout or silent mode evaluation for 2–4 weeks.

Context

Current Performance Snapshot

Offline vs. Live Metrics (last 4 weeks)

Metric	Offline Test (held-out retrospective)	Live (prospective, adjudicated sample)
AUC-ROC	0.91	0.86
AUPRC	0.34	0.18
Sensitivity (Recall) @ current threshold	0.78	0.52
Precision (PPV) @ current threshold	0.22	0.11
Specificity @ current threshold	0.93	0.90
Brier score (calibration)	0.072	0.118
% of patients alerted at least once	7.5%	12.8%
Median lead time (TPs only)	5.1 hours	2.0 hours

Operational / Clinical Outcomes (last 4 weeks)

Average daily inpatients scored: 9,500
Average daily alerts fired: 1,200
Rapid response calls: +9% vs prior quarter
Sepsis-related ICU transfers: flat overall, but 5 high-profile misses reviewed by the safety committee

The Problem

Your Task (what you must cover)

Reconcile why AUC-ROC remains relatively high while precision, recall, and lead time degrade in live use.
Propose a structured evaluation plan to determine whether the issue is (a) label misalignment, (b) calibration/thresholding, (c) distribution shift, (d) temporal leakage, or (e) workflow effects.
Using the confusion matrix below, quantify the clinical and operational cost of FP vs FN, and recommend whether to move the threshold.
Recommend at least 4 concrete improvements spanning model, data, and deployment (not just “retrain”).
Define what “success” should mean in production: which metrics to track weekly, and how you would validate improvements without harming patient safety.

Constraints

Alert fatigue is a major safety risk: nursing leadership requires ≤800 alerts/day across the network.
False negatives are high severity (missed sepsis can lead to mortality), but false positives consume staff time and can increase unnecessary antibiotics.
Ground truth is delayed/noisy: final sepsis adjudication may take 7–14 days, and coding practices vary by hospital.
You cannot run an RCT immediately; the fastest option is a stepped-wedge rollout or silent mode evaluation for 2–4 weeks.

Interview Guides

Context

Current Performance Snapshot

Offline vs. Live Metrics (last 4 weeks)

Operational / Clinical Outcomes (last 4 weeks)

The Problem

Your Task (what you must cover)

Constraints

Diagnose Misaligned Sepsis Risk Predictions

Context

Current Performance Snapshot

Offline vs. Live Metrics (last 4 weeks)

Operational / Clinical Outcomes (last 4 weeks)

The Problem

Your Task (what you must cover)

Constraints

Your Answer

Diagnose Misaligned Sepsis Risk Predictions

Context

Current Performance Snapshot

Offline vs. Live Metrics (last 4 weeks)

Operational / Clinical Outcomes (last 4 weeks)

The Problem

Your Task (what you must cover)

Constraints

Diagnose Misaligned Sepsis Risk Predictions

Context

Current Performance Snapshot

Offline vs. Live Metrics (last 4 weeks)

Operational / Clinical Outcomes (last 4 weeks)

The Problem

Your Task (what you must cover)

Constraints

Your Answer