Evaluate Claims Triage Classifier

Context

BCG Digital Ventures has deployed a binary classifier in a claims triage workflow to identify insurance claims that should be escalated for manual fraud review. The current model is a gradient boosted tree classifier used in production scoring, but fraud operations reports that too many suspicious claims are still being approved automatically.

Current Performance

Metric	Validation Set	Prior Model	Change
Precision	0.84	0.76	+0.08
Recall	0.58	0.71	-0.13
F1-score	0.69	0.73	-0.04
ROC-AUC	0.87	0.84	+0.03
Review rate	6.2%	9.8%	-3.6 pts
Fraud prevalence	4.0%	4.0%	0.0 pts

Confusion Matrix at Current Threshold (0.70)

On a validation sample of 50,000 claims:

	Predicted Fraud	Predicted Non-Fraud
Actual Fraud	1,160	840
Actual Non-Fraud	221	47,779

The Problem

Leadership wants to know whether this model is actually better than the prior version and whether the operating threshold is appropriate. The metrics appear mixed: precision and ROC-AUC improved, but recall and F1-score declined materially.

Requirements

Interpret precision, recall, F1-score, and ROC-AUC using the numbers above.
Explain what the confusion matrix says about model behavior at the current threshold.
Diagnose why ROC-AUC can improve while recall worsens.
Recommend whether to keep, retune, or replace the model.
Propose how you would evaluate threshold changes under business constraints.

Constraints

Manual review team can handle at most 4,000 claims per day.
Missing a fraudulent claim costs about $1,200 on average.
Reviewing a legitimate claim costs about $18 in analyst time and customer friction.

Context

Current Performance

Metric	Validation Set	Prior Model	Change
Precision	0.84	0.76	+0.08
Recall	0.58	0.71	-0.13
F1-score	0.69	0.73	-0.04
ROC-AUC	0.87	0.84	+0.03
Review rate	6.2%	9.8%	-3.6 pts
Fraud prevalence	4.0%	4.0%	0.0 pts

Confusion Matrix at Current Threshold (0.70)

On a validation sample of 50,000 claims:

	Predicted Fraud	Predicted Non-Fraud
Actual Fraud	1,160	840
Actual Non-Fraud	221	47,779

The Problem

Requirements

Interpret precision, recall, F1-score, and ROC-AUC using the numbers above.
Explain what the confusion matrix says about model behavior at the current threshold.
Diagnose why ROC-AUC can improve while recall worsens.
Recommend whether to keep, retune, or replace the model.
Propose how you would evaluate threshold changes under business constraints.

Constraints

Manual review team can handle at most 4,000 claims per day.
Missing a fraudulent claim costs about $1,200 on average.
Reviewing a legitimate claim costs about $18 in analyst time and customer friction.

Context

Current Performance

Metric	Validation Set	Prior Model	Change
Precision	0.84	0.76	+0.08
Recall	0.58	0.71	-0.13
F1-score	0.69	0.73	-0.04
ROC-AUC	0.87	0.84	+0.03
Review rate	6.2%	9.8%	-3.6 pts
Fraud prevalence	4.0%	4.0%	0.0 pts

Confusion Matrix at Current Threshold (0.70)

On a validation sample of 50,000 claims:

	Predicted Fraud	Predicted Non-Fraud
Actual Fraud	1,160	840
Actual Non-Fraud	221	47,779

The Problem

Requirements

Interpret precision, recall, F1-score, and ROC-AUC using the numbers above.
Explain what the confusion matrix says about model behavior at the current threshold.
Diagnose why ROC-AUC can improve while recall worsens.
Recommend whether to keep, retune, or replace the model.
Propose how you would evaluate threshold changes under business constraints.

Constraints

Manual review team can handle at most 4,000 claims per day.
Missing a fraudulent claim costs about $1,200 on average.
Reviewing a legitimate claim costs about $18 in analyst time and customer friction.

Context

Current Performance

Metric	Validation Set	Prior Model	Change
Precision	0.84	0.76	+0.08
Recall	0.58	0.71	-0.13
F1-score	0.69	0.73	-0.04
ROC-AUC	0.87	0.84	+0.03
Review rate	6.2%	9.8%	-3.6 pts
Fraud prevalence	4.0%	4.0%	0.0 pts

Confusion Matrix at Current Threshold (0.70)

On a validation sample of 50,000 claims:

	Predicted Fraud	Predicted Non-Fraud
Actual Fraud	1,160	840
Actual Non-Fraud	221	47,779

The Problem

Requirements

Interpret precision, recall, F1-score, and ROC-AUC using the numbers above.
Explain what the confusion matrix says about model behavior at the current threshold.
Diagnose why ROC-AUC can improve while recall worsens.
Recommend whether to keep, retune, or replace the model.
Propose how you would evaluate threshold changes under business constraints.

Constraints

Manual review team can handle at most 4,000 claims per day.
Missing a fraudulent claim costs about $1,200 on average.
Reviewing a legitimate claim costs about $18 in analyst time and customer friction.

Interview Guides

Context

Current Performance

Confusion Matrix at Current Threshold (0.70)

The Problem

Requirements

Constraints

Evaluate Claims Triage Classifier

Context

Current Performance

Confusion Matrix at Current Threshold (0.70)

The Problem

Requirements

Constraints

Your Answer

Evaluate Claims Triage Classifier

Context

Current Performance

Confusion Matrix at Current Threshold (0.70)

The Problem

Requirements

Constraints

Evaluate Claims Triage Classifier

Context

Current Performance

Confusion Matrix at Current Threshold (0.70)

The Problem

Requirements

Constraints

Your Answer