You’re the on-call ML scientist for NorthBridge Health, a 12-hospital network (≈3.5M ED + inpatient encounters/year). The organization deployed a sepsis early warning model into the EHR to help clinicians identify patients at risk of developing sepsis within the next 12 hours. The model runs every 30 minutes for admitted patients and produces a risk score (0–1). If the score exceeds a threshold, the EHR triggers an alert to the bedside nurse and the covering physician.
The model was trained on 24 months of retrospective data (≈1.8M admissions). Labels were derived from a Sepsis-3 proxy: suspected infection + organ dysfunction within a time window. The model is a gradient-boosted tree ensemble using vitals, labs, demographics, comorbidities, and limited text-derived features (problem list and triage note keywords). Deployment began 10 weeks ago.
NorthBridge’s CMIO reports: “The model looks great in validation, but alerts don’t match what we see clinically. We’re getting too many alerts on stable patients, and we missed several patients who later required ICU transfer for sepsis.” Clinicians are losing trust, and the hospital is considering disabling alerts.
| Metric | Offline Test (held-out retrospective) | Live (prospective, adjudicated sample) |
|---|---|---|
| AUC-ROC | 0.91 | 0.86 |
| AUPRC | 0.34 | 0.18 |
| Sensitivity (Recall) @ current threshold | 0.78 | 0.52 |
| Precision (PPV) @ current threshold | 0.22 | 0.11 |
| Specificity @ current threshold | 0.93 | 0.90 |
| Brier score (calibration) | 0.072 | 0.118 |
| % of patients alerted at least once | 7.5% | 12.8% |
| Median lead time (TPs only) | 5.1 hours | 2.0 hours |
The model’s predictions are not aligning with clinical outcomes: it fires frequently on patients who do not deteriorate, and it misses some patients who do. You need to diagnose whether this is primarily a metric/threshold issue, label/definition mismatch, calibration drift, data leakage in offline evaluation, distribution shift, or workflow/measurement artifacts.