Interpret F1 for Imbalanced Classification

Context

ShopSafe is building a binary classifier to detect fraudulent orders before fulfillment. Fraud is rare, so leadership is concerned that the current evaluation dashboard may overstate model quality by focusing on accuracy.

Current Performance

The team evaluated a logistic regression model on 100,000 recent orders. Only 1,000 orders were actually fraudulent.

Metric	Value
Accuracy	0.972
Precision	0.750
Recall	0.180
F1 Score	0.290
AUC-ROC	0.840
Fraud prevalence	0.010

Confusion matrix counts:

	Predicted Fraud	Predicted Legitimate
Actual Fraud	180	820
Actual Legitimate	60	98,940

The Problem

The product manager sees 97.2% accuracy and believes the model is ready for launch. However, the risk team argues that the model is still weak because it misses most fraudulent orders. You need to explain what the F1 score means, why it matters here, and whether it is a better summary metric than accuracy.

Requirements

Define precision, recall, and F1 score using the numbers above.
Explain how the F1 score is calculated and interpret the value 0.290.
Compare F1 score with accuracy for this imbalanced dataset.
Explain why a model can have high accuracy but poor fraud detection performance.
Recommend 2-3 actions to improve the model or evaluation process.

Constraints

Each missed fraud order costs about $120 on average.
Each false positive triggers a manual review costing $4.
Review capacity is limited to 400 flagged orders per day.

Context

Current Performance

The team evaluated a logistic regression model on 100,000 recent orders. Only 1,000 orders were actually fraudulent.

Metric	Value
Accuracy	0.972
Precision	0.750
Recall	0.180
F1 Score	0.290
AUC-ROC	0.840
Fraud prevalence	0.010

Confusion matrix counts:

	Predicted Fraud	Predicted Legitimate
Actual Fraud	180	820
Actual Legitimate	60	98,940

The Problem

Requirements

Define precision, recall, and F1 score using the numbers above.
Explain how the F1 score is calculated and interpret the value 0.290.
Compare F1 score with accuracy for this imbalanced dataset.
Explain why a model can have high accuracy but poor fraud detection performance.
Recommend 2-3 actions to improve the model or evaluation process.

Constraints

Each missed fraud order costs about $120 on average.
Each false positive triggers a manual review costing $4.
Review capacity is limited to 400 flagged orders per day.

Context

Current Performance

The team evaluated a logistic regression model on 100,000 recent orders. Only 1,000 orders were actually fraudulent.

Metric	Value
Accuracy	0.972
Precision	0.750
Recall	0.180
F1 Score	0.290
AUC-ROC	0.840
Fraud prevalence	0.010

Confusion matrix counts:

	Predicted Fraud	Predicted Legitimate
Actual Fraud	180	820
Actual Legitimate	60	98,940

The Problem

Requirements

Define precision, recall, and F1 score using the numbers above.
Explain how the F1 score is calculated and interpret the value 0.290.
Compare F1 score with accuracy for this imbalanced dataset.
Explain why a model can have high accuracy but poor fraud detection performance.
Recommend 2-3 actions to improve the model or evaluation process.

Constraints

Each missed fraud order costs about $120 on average.
Each false positive triggers a manual review costing $4.
Review capacity is limited to 400 flagged orders per day.

Context

Current Performance

The team evaluated a logistic regression model on 100,000 recent orders. Only 1,000 orders were actually fraudulent.

Metric	Value
Accuracy	0.972
Precision	0.750
Recall	0.180
F1 Score	0.290
AUC-ROC	0.840
Fraud prevalence	0.010

Confusion matrix counts:

	Predicted Fraud	Predicted Legitimate
Actual Fraud	180	820
Actual Legitimate	60	98,940

The Problem

Requirements

Define precision, recall, and F1 score using the numbers above.
Explain how the F1 score is calculated and interpret the value 0.290.
Compare F1 score with accuracy for this imbalanced dataset.
Explain why a model can have high accuracy but poor fraud detection performance.
Recommend 2-3 actions to improve the model or evaluation process.

Constraints

Each missed fraud order costs about $120 on average.
Each false positive triggers a manual review costing $4.
Review capacity is limited to 400 flagged orders per day.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Interpret F1 for Imbalanced Classification

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Interpret F1 for Imbalanced Classification

Context

Current Performance

The Problem

Requirements

Constraints

Interpret F1 for Imbalanced Classification

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer