Evaluate Sepsis Risk Model Success

Context

NorthRiver Health deployed a gradient boosting model to predict inpatient sepsis risk within the next 12 hours and trigger rapid-response review. After a 6-week pilot across 3 hospitals, leadership sees mixed results: the model improved early detection, but clinicians report too many alerts and uneven performance across units.

Current Performance

Metric	Validation Set	Pilot Production	Target
AUC-ROC	0.89	0.84	>= 0.85
Precision	0.41	0.28	>= 0.35
Recall	0.76	0.71	>= 0.75
F1 Score	0.53	0.40	>= 0.50
Alert rate	9.8%	14.2%	<= 10%
Calibration slope	0.97	0.78	0.95-1.05
Median lead time before diagnosis	5.6 hrs	4.1 hrs	>= 4 hrs
ICU unit recall	0.81	0.79	>= 0.75
General ward recall	0.74	0.63	>= 0.72

During the pilot, 18,400 admissions were scored. Sepsis prevalence was 6.5% (1,196 cases). At the current threshold, the model generated 2,613 alerts, including 732 true positives and 1,881 false positives.

The Problem

You need to assess whether this AI project should be considered successful in a healthcare setting, where clinical benefit, alert burden, calibration, and patient safety all matter—not just discrimination metrics.

Requirements

Interpret whether the pilot met success criteria from both a model and clinical operations perspective.
Explain what the metric shifts from validation to production suggest.
Identify the most important risks in deploying this model more broadly.
Recommend how you would validate performance across patient segments and care settings.
Propose concrete next steps to improve the model and deployment policy.

Constraints

False negatives can delay treatment for a life-threatening condition.
False positives contribute to alert fatigue and clinician distrust.
The rapid-response team can review at most 2,000 alerts per month without adding staff.

Context

Current Performance

Metric	Validation Set	Pilot Production	Target
AUC-ROC	0.89	0.84	>= 0.85
Precision	0.41	0.28	>= 0.35
Recall	0.76	0.71	>= 0.75
F1 Score	0.53	0.40	>= 0.50
Alert rate	9.8%	14.2%	<= 10%
Calibration slope	0.97	0.78	0.95-1.05
Median lead time before diagnosis	5.6 hrs	4.1 hrs	>= 4 hrs
ICU unit recall	0.81	0.79	>= 0.75
General ward recall	0.74	0.63	>= 0.72

The Problem

Requirements

Interpret whether the pilot met success criteria from both a model and clinical operations perspective.
Explain what the metric shifts from validation to production suggest.
Identify the most important risks in deploying this model more broadly.
Recommend how you would validate performance across patient segments and care settings.
Propose concrete next steps to improve the model and deployment policy.

Constraints

False negatives can delay treatment for a life-threatening condition.
False positives contribute to alert fatigue and clinician distrust.
The rapid-response team can review at most 2,000 alerts per month without adding staff.

Context

Current Performance

Metric	Validation Set	Pilot Production	Target
AUC-ROC	0.89	0.84	>= 0.85
Precision	0.41	0.28	>= 0.35
Recall	0.76	0.71	>= 0.75
F1 Score	0.53	0.40	>= 0.50
Alert rate	9.8%	14.2%	<= 10%
Calibration slope	0.97	0.78	0.95-1.05
Median lead time before diagnosis	5.6 hrs	4.1 hrs	>= 4 hrs
ICU unit recall	0.81	0.79	>= 0.75
General ward recall	0.74	0.63	>= 0.72

The Problem

Requirements

Interpret whether the pilot met success criteria from both a model and clinical operations perspective.
Explain what the metric shifts from validation to production suggest.
Identify the most important risks in deploying this model more broadly.
Recommend how you would validate performance across patient segments and care settings.
Propose concrete next steps to improve the model and deployment policy.

Constraints

False negatives can delay treatment for a life-threatening condition.
False positives contribute to alert fatigue and clinician distrust.
The rapid-response team can review at most 2,000 alerts per month without adding staff.

Context

Current Performance

Metric	Validation Set	Pilot Production	Target
AUC-ROC	0.89	0.84	>= 0.85
Precision	0.41	0.28	>= 0.35
Recall	0.76	0.71	>= 0.75
F1 Score	0.53	0.40	>= 0.50
Alert rate	9.8%	14.2%	<= 10%
Calibration slope	0.97	0.78	0.95-1.05
Median lead time before diagnosis	5.6 hrs	4.1 hrs	>= 4 hrs
ICU unit recall	0.81	0.79	>= 0.75
General ward recall	0.74	0.63	>= 0.72

The Problem

Requirements

Interpret whether the pilot met success criteria from both a model and clinical operations perspective.
Explain what the metric shifts from validation to production suggest.
Identify the most important risks in deploying this model more broadly.
Recommend how you would validate performance across patient segments and care settings.
Propose concrete next steps to improve the model and deployment policy.

Constraints

False negatives can delay treatment for a life-threatening condition.
False positives contribute to alert fatigue and clinician distrust.
The rapid-response team can review at most 2,000 alerts per month without adding staff.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Sepsis Risk Model Success

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Evaluate Sepsis Risk Model Success

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Sepsis Risk Model Success

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer