Evaluate Operational Impact of Classifier

Context

ShopFlow uses a binary classifier to predict whether a customer support ticket should be escalated to the specialist operations team. The new gradient-boosted model replaced a logistic regression baseline in a shadow test, but leadership cares about operational improvement, not just offline lift. Specialist capacity is fixed, and missed escalations create SLA breaches and customer churn risk.

Current Performance

Metric	Baseline Model	New Model	Change
Accuracy	0.842	0.861	+0.019
Precision	0.610	0.540	-0.070
Recall	0.420	0.680	+0.260
F1 Score	0.497	0.602	+0.105
AUC-ROC	0.781	0.844	+0.063
Log Loss	0.412	0.356	-0.056
Daily escalations predicted	820	1,480	+660
Daily true escalation need	1,200	1,200	0
Avg SLA breaches/day	696	384	-312
Specialist review capacity/day	1,300	1,300	0

The Problem

The new classifier finds more truly urgent tickets, but it also sends many more tickets to specialists and exceeds daily review capacity. You need to determine whether the model is actually improving operations or simply shifting work downstream.

Requirements

Interpret whether the new model is better for operations, not just offline metrics.
Quantify the trade-off between false negatives and false positives using the data provided.
Assess whether the current decision threshold is appropriate given capacity limits.
Identify likely failure modes and what additional analysis you would run before launch.
Recommend concrete changes to model usage, evaluation, or thresholding.

Constraints

Specialist team can handle at most 1,300 escalations per day.
A false negative causes an average of $42 in downstream cost.
A false positive costs $6 in specialist handling time.
Ticket labels are finalized with a 7-day delay.

Context

Current Performance

Metric	Baseline Model	New Model	Change
Accuracy	0.842	0.861	+0.019
Precision	0.610	0.540	-0.070
Recall	0.420	0.680	+0.260
F1 Score	0.497	0.602	+0.105
AUC-ROC	0.781	0.844	+0.063
Log Loss	0.412	0.356	-0.056
Daily escalations predicted	820	1,480	+660
Daily true escalation need	1,200	1,200	0
Avg SLA breaches/day	696	384	-312
Specialist review capacity/day	1,300	1,300	0

The Problem

Requirements

Interpret whether the new model is better for operations, not just offline metrics.
Quantify the trade-off between false negatives and false positives using the data provided.
Assess whether the current decision threshold is appropriate given capacity limits.
Identify likely failure modes and what additional analysis you would run before launch.
Recommend concrete changes to model usage, evaluation, or thresholding.

Constraints

Specialist team can handle at most 1,300 escalations per day.
A false negative causes an average of $42 in downstream cost.
A false positive costs $6 in specialist handling time.
Ticket labels are finalized with a 7-day delay.

Context

Current Performance

Metric	Baseline Model	New Model	Change
Accuracy	0.842	0.861	+0.019
Precision	0.610	0.540	-0.070
Recall	0.420	0.680	+0.260
F1 Score	0.497	0.602	+0.105
AUC-ROC	0.781	0.844	+0.063
Log Loss	0.412	0.356	-0.056
Daily escalations predicted	820	1,480	+660
Daily true escalation need	1,200	1,200	0
Avg SLA breaches/day	696	384	-312
Specialist review capacity/day	1,300	1,300	0

The Problem

Requirements

Interpret whether the new model is better for operations, not just offline metrics.
Quantify the trade-off between false negatives and false positives using the data provided.
Assess whether the current decision threshold is appropriate given capacity limits.
Identify likely failure modes and what additional analysis you would run before launch.
Recommend concrete changes to model usage, evaluation, or thresholding.

Constraints

Specialist team can handle at most 1,300 escalations per day.
A false negative causes an average of $42 in downstream cost.
A false positive costs $6 in specialist handling time.
Ticket labels are finalized with a 7-day delay.

Context

Current Performance

Metric	Baseline Model	New Model	Change
Accuracy	0.842	0.861	+0.019
Precision	0.610	0.540	-0.070
Recall	0.420	0.680	+0.260
F1 Score	0.497	0.602	+0.105
AUC-ROC	0.781	0.844	+0.063
Log Loss	0.412	0.356	-0.056
Daily escalations predicted	820	1,480	+660
Daily true escalation need	1,200	1,200	0
Avg SLA breaches/day	696	384	-312
Specialist review capacity/day	1,300	1,300	0

The Problem

Requirements

Interpret whether the new model is better for operations, not just offline metrics.
Quantify the trade-off between false negatives and false positives using the data provided.
Assess whether the current decision threshold is appropriate given capacity limits.
Identify likely failure modes and what additional analysis you would run before launch.
Recommend concrete changes to model usage, evaluation, or thresholding.

Constraints

Specialist team can handle at most 1,300 escalations per day.
A false negative causes an average of $42 in downstream cost.
A false positive costs $6 in specialist handling time.
Ticket labels are finalized with a 7-day delay.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Operational Impact of Classifier

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Evaluate Operational Impact of Classifier

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Operational Impact of Classifier

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer