Choose Metrics for Imbalanced Claims

Context

SureShield Insurance built a binary classification model to predict whether a newly submitted claim will become a high-cost fraudulent claim requiring special investigation. The dataset is highly imbalanced: only 1.8% of historical claims are labeled fraud. The team initially optimized for accuracy, but fraud losses remain high and investigators say too many risky claims are being missed.

Current Performance

Metric	Validation Set	Notes
Accuracy	0.972	High due to class imbalance
Precision	0.41	41% of flagged claims are actually fraud
Recall	0.29	Model catches less than one-third of fraud cases
F1 Score	0.34	Weak balance between precision and recall
AUC-ROC	0.86	Good ranking overall
Log Loss	0.118	Probabilities are moderately informative
Fraud rate	1.8%	1,800 fraud cases in 100,000 claims
Claims flagged for review	1,275	Limited by investigation capacity

The Problem

The VP of Claims wants a recommendation on which evaluation metric should be the primary decision metric for model selection and threshold tuning. The answer must reflect the severe class imbalance and the business cost asymmetry: a missed fraudulent claim costs about $12,000 on average, while investigating a legitimate claim costs about $85.

Requirements

Identify which metric(s) should be prioritized over accuracy and explain why.
Interpret what the current metrics imply about model behavior on an imbalanced dataset.
Recommend how to evaluate threshold tradeoffs given limited investigator capacity.
Propose a validation approach for comparing future models.
Suggest concrete next steps to improve business impact, not just headline metrics.

Constraints

Investigation team can review at most 1,500 claims per week.
False negatives are far more expensive than false positives.
The model score is used to rank claims, then a threshold determines review.

Context

Current Performance

Metric	Validation Set	Notes
Accuracy	0.972	High due to class imbalance
Precision	0.41	41% of flagged claims are actually fraud
Recall	0.29	Model catches less than one-third of fraud cases
F1 Score	0.34	Weak balance between precision and recall
AUC-ROC	0.86	Good ranking overall
Log Loss	0.118	Probabilities are moderately informative
Fraud rate	1.8%	1,800 fraud cases in 100,000 claims
Claims flagged for review	1,275	Limited by investigation capacity

The Problem

Requirements

Identify which metric(s) should be prioritized over accuracy and explain why.
Interpret what the current metrics imply about model behavior on an imbalanced dataset.
Recommend how to evaluate threshold tradeoffs given limited investigator capacity.
Propose a validation approach for comparing future models.
Suggest concrete next steps to improve business impact, not just headline metrics.

Constraints

Investigation team can review at most 1,500 claims per week.
False negatives are far more expensive than false positives.
The model score is used to rank claims, then a threshold determines review.

Context

Current Performance

Metric	Validation Set	Notes
Accuracy	0.972	High due to class imbalance
Precision	0.41	41% of flagged claims are actually fraud
Recall	0.29	Model catches less than one-third of fraud cases
F1 Score	0.34	Weak balance between precision and recall
AUC-ROC	0.86	Good ranking overall
Log Loss	0.118	Probabilities are moderately informative
Fraud rate	1.8%	1,800 fraud cases in 100,000 claims
Claims flagged for review	1,275	Limited by investigation capacity

The Problem

Requirements

Identify which metric(s) should be prioritized over accuracy and explain why.
Interpret what the current metrics imply about model behavior on an imbalanced dataset.
Recommend how to evaluate threshold tradeoffs given limited investigator capacity.
Propose a validation approach for comparing future models.
Suggest concrete next steps to improve business impact, not just headline metrics.

Constraints

Investigation team can review at most 1,500 claims per week.
False negatives are far more expensive than false positives.
The model score is used to rank claims, then a threshold determines review.

Context

Current Performance

Metric	Validation Set	Notes
Accuracy	0.972	High due to class imbalance
Precision	0.41	41% of flagged claims are actually fraud
Recall	0.29	Model catches less than one-third of fraud cases
F1 Score	0.34	Weak balance between precision and recall
AUC-ROC	0.86	Good ranking overall
Log Loss	0.118	Probabilities are moderately informative
Fraud rate	1.8%	1,800 fraud cases in 100,000 claims
Claims flagged for review	1,275	Limited by investigation capacity

The Problem

Requirements

Identify which metric(s) should be prioritized over accuracy and explain why.
Interpret what the current metrics imply about model behavior on an imbalanced dataset.
Recommend how to evaluate threshold tradeoffs given limited investigator capacity.
Propose a validation approach for comparing future models.
Suggest concrete next steps to improve business impact, not just headline metrics.

Constraints

Investigation team can review at most 1,500 claims per week.
False negatives are far more expensive than false positives.
The model score is used to rank claims, then a threshold determines review.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Choose Metrics for Imbalanced Claims

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Choose Metrics for Imbalanced Claims

Context

Current Performance

The Problem

Requirements

Constraints

Choose Metrics for Imbalanced Claims

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer