You are the lead ML scientist at MedSure Diagnostics, a healthcare SaaS vendor that provides decision-support software to hospital networks in the US and EU. Your product ingests routine lab panels (CBC, CMP), vitals, and EHR-derived features to predict sepsis risk within 6 hours and recommend whether a patient should be escalated to a rapid-response team.
The system is being prepared for deployment in a GxP environment (customers treat it as software impacting patient safety and regulated quality systems). The model is a gradient-boosted tree classifier trained on retrospective data from 18 hospitals (approx. 2.4M ED encounters over 24 months). Sepsis prevalence is ~1.2%. Labels are derived from a clinical adjudication pipeline finalized 7 days post-encounter. The model outputs a probability score and triggers an alert when the score exceeds a threshold.
Validation was run on a temporally held-out set (last 3 months) from 6 hospitals not used in training. The clinical team is concerned because, despite strong discrimination, they observed “too many missed cases” and inconsistent alert rates across sites.
| Metric (overall) | Value | Notes |
|---|---|---|
| AUC-ROC | 0.93 | Strong ranking performance |
| Precision (PPV) @ threshold=0.20 | 0.11 | Many alerts are not sepsis |
| Recall (Sensitivity) @ threshold=0.20 | 0.68 | ~32% of sepsis missed |
| Specificity @ threshold=0.20 | 0.92 | Low false positive rate in absolute terms |
| Brier score | 0.041 | Suggests miscalibration given low prevalence |
| ECE (Expected Calibration Error) | 0.074 | Over-confident probabilities in 0.3–0.6 range |
| Alert rate | 8.0% | Varies 3%–14% by hospital |
| Hospital slice | Prevalence | Recall | Precision | Alert rate |
|---|---|---|---|---|
| Site A (urban academic) | 1.6% | 0.74 | 0.10 | 12% |
| Site B (community) | 0.9% | 0.61 | 0.13 | 5% |
| Site C (EU) | 1.1% | 0.66 | 0.08 | 14% |
MedSure must produce a GxP-grade validation package suitable for customer audits and internal quality review before go-live. The VP of Clinical Safety is asking for a concrete plan that demonstrates the model is fit for intended use, robust to site differences, and controlled throughout its lifecycle.
Propose a complete validation strategy appropriate for a GxP environment. Your answer should cover: