Evaluate Metrics for Rare Player Behavior

Context

At ArenaPlay, a binary classification model predicts whether a player will exhibit a rare harmful behavior within the next 30 days. The event rate is only 0.1%: in a validation set of 1,000,000 players, only 1,000 are actual positives.

The team initially celebrated high accuracy, but operations reports show the model is missing many true cases while also generating too many alerts for the review team.

Current Performance

Metric	Current Model	Baseline: Predict All Negative
Accuracy	99.40%	99.90%
Precision	8.3%	0.0%
Recall	50.0%	0.0%
F1 Score	14.3%	0.0%
AUC-ROC	0.91	0.50
PR AUC	0.19	0.001
Flagged players	6,000	0

The Problem

Leadership wants to know which metrics should be used to evaluate this model and whether the current model is good enough to deploy. The main concern is that accuracy looks excellent despite poor practical usefulness in a highly imbalanced setting.

Requirements

Explain which evaluation metrics are most appropriate for this class imbalance and why.
Interpret the current metrics and identify what they imply about model quality.
Discuss whether accuracy and AUC-ROC alone are sufficient here.
Recommend how you would choose an operating threshold based on business tradeoffs.
Propose improvements to evaluation and validation before deployment.

Constraints

Manual review team can investigate at most 2,000 flagged players per week.
Missing a true positive is estimated to be 20x more costly than reviewing a false positive.
Predicted probabilities may be used downstream for prioritization, so score calibration matters.

Context

The team initially celebrated high accuracy, but operations reports show the model is missing many true cases while also generating too many alerts for the review team.

Current Performance

Metric	Current Model	Baseline: Predict All Negative
Accuracy	99.40%	99.90%
Precision	8.3%	0.0%
Recall	50.0%	0.0%
F1 Score	14.3%	0.0%
AUC-ROC	0.91	0.50
PR AUC	0.19	0.001
Flagged players	6,000	0

The Problem

Requirements

Explain which evaluation metrics are most appropriate for this class imbalance and why.
Interpret the current metrics and identify what they imply about model quality.
Discuss whether accuracy and AUC-ROC alone are sufficient here.
Recommend how you would choose an operating threshold based on business tradeoffs.
Propose improvements to evaluation and validation before deployment.

Constraints

Manual review team can investigate at most 2,000 flagged players per week.
Missing a true positive is estimated to be 20x more costly than reviewing a false positive.
Predicted probabilities may be used downstream for prioritization, so score calibration matters.

Context

The team initially celebrated high accuracy, but operations reports show the model is missing many true cases while also generating too many alerts for the review team.

Current Performance

Metric	Current Model	Baseline: Predict All Negative
Accuracy	99.40%	99.90%
Precision	8.3%	0.0%
Recall	50.0%	0.0%
F1 Score	14.3%	0.0%
AUC-ROC	0.91	0.50
PR AUC	0.19	0.001
Flagged players	6,000	0

The Problem

Requirements

Explain which evaluation metrics are most appropriate for this class imbalance and why.
Interpret the current metrics and identify what they imply about model quality.
Discuss whether accuracy and AUC-ROC alone are sufficient here.
Recommend how you would choose an operating threshold based on business tradeoffs.
Propose improvements to evaluation and validation before deployment.

Constraints

Manual review team can investigate at most 2,000 flagged players per week.
Missing a true positive is estimated to be 20x more costly than reviewing a false positive.
Predicted probabilities may be used downstream for prioritization, so score calibration matters.

Context

The team initially celebrated high accuracy, but operations reports show the model is missing many true cases while also generating too many alerts for the review team.

Current Performance

Metric	Current Model	Baseline: Predict All Negative
Accuracy	99.40%	99.90%
Precision	8.3%	0.0%
Recall	50.0%	0.0%
F1 Score	14.3%	0.0%
AUC-ROC	0.91	0.50
PR AUC	0.19	0.001
Flagged players	6,000	0

The Problem

Requirements

Explain which evaluation metrics are most appropriate for this class imbalance and why.
Interpret the current metrics and identify what they imply about model quality.
Discuss whether accuracy and AUC-ROC alone are sufficient here.
Recommend how you would choose an operating threshold based on business tradeoffs.
Propose improvements to evaluation and validation before deployment.

Constraints

Manual review team can investigate at most 2,000 flagged players per week.
Missing a true positive is estimated to be 20x more costly than reviewing a false positive.
Predicted probabilities may be used downstream for prioritization, so score calibration matters.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Metrics for Rare Player Behavior

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Evaluate Metrics for Rare Player Behavior

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Metrics for Rare Player Behavior

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer