Evaluate Harmful Content Classifier

Context

CivitasAI is preparing to deploy a text classification model that flags harmful user prompts before they reach a generative assistant. The model is a fine-tuned transformer binary classifier trained to detect harassment, self-harm encouragement, violent threats, and illegal-instruction requests.

An offline evaluation on a 50,000-example holdout set shows strong aggregate accuracy, but the Trust & Safety team is concerned that harmful misses remain too high for production launch.

Current Performance

Metric	Validation Set	Target for Launch
Accuracy	0.947	>= 0.940
Precision	0.781	>= 0.800
Recall	0.684	>= 0.850
F1 Score	0.729	>= 0.820
AUC-ROC	0.912	>= 0.900
False Positive Rate	0.041	<= 0.030
Harmful prevalence	12.0%	n/a

The Problem

The model looks acceptable on top-line metrics, but recall is materially below the launch threshold. Missing harmful prompts is considered more costly than over-blocking benign prompts, especially in high-risk categories such as self-harm and explicit violence.

Task

Evaluate whether this model is ready for production given the current metrics.
Interpret the tradeoff between precision, recall, and false positive rate for this use case.
Identify the most important error patterns to investigate before launch.
Recommend how you would validate the model across harmful subcategories and user segments.
Propose concrete changes to improve safety performance without creating unacceptable user friction.

Constraints

The moderation team can manually review at most 2,500 flagged prompts per day.
False negatives in self-harm and violent-threat categories are treated as the highest-risk failures.
Product leadership will not approve launch if benign prompt block rate exceeds 5%.

Context

An offline evaluation on a 50,000-example holdout set shows strong aggregate accuracy, but the Trust & Safety team is concerned that harmful misses remain too high for production launch.

Current Performance

Metric	Validation Set	Target for Launch
Accuracy	0.947	>= 0.940
Precision	0.781	>= 0.800
Recall	0.684	>= 0.850
F1 Score	0.729	>= 0.820
AUC-ROC	0.912	>= 0.900
False Positive Rate	0.041	<= 0.030
Harmful prevalence	12.0%	n/a

The Problem

Task

Evaluate whether this model is ready for production given the current metrics.
Interpret the tradeoff between precision, recall, and false positive rate for this use case.
Identify the most important error patterns to investigate before launch.
Recommend how you would validate the model across harmful subcategories and user segments.
Propose concrete changes to improve safety performance without creating unacceptable user friction.

Constraints

The moderation team can manually review at most 2,500 flagged prompts per day.
False negatives in self-harm and violent-threat categories are treated as the highest-risk failures.
Product leadership will not approve launch if benign prompt block rate exceeds 5%.

Context

An offline evaluation on a 50,000-example holdout set shows strong aggregate accuracy, but the Trust & Safety team is concerned that harmful misses remain too high for production launch.

Current Performance

Metric	Validation Set	Target for Launch
Accuracy	0.947	>= 0.940
Precision	0.781	>= 0.800
Recall	0.684	>= 0.850
F1 Score	0.729	>= 0.820
AUC-ROC	0.912	>= 0.900
False Positive Rate	0.041	<= 0.030
Harmful prevalence	12.0%	n/a

The Problem

Task

Evaluate whether this model is ready for production given the current metrics.
Interpret the tradeoff between precision, recall, and false positive rate for this use case.
Identify the most important error patterns to investigate before launch.
Recommend how you would validate the model across harmful subcategories and user segments.
Propose concrete changes to improve safety performance without creating unacceptable user friction.

Constraints

The moderation team can manually review at most 2,500 flagged prompts per day.
False negatives in self-harm and violent-threat categories are treated as the highest-risk failures.
Product leadership will not approve launch if benign prompt block rate exceeds 5%.

Context

An offline evaluation on a 50,000-example holdout set shows strong aggregate accuracy, but the Trust & Safety team is concerned that harmful misses remain too high for production launch.

Current Performance

Metric	Validation Set	Target for Launch
Accuracy	0.947	>= 0.940
Precision	0.781	>= 0.800
Recall	0.684	>= 0.850
F1 Score	0.729	>= 0.820
AUC-ROC	0.912	>= 0.900
False Positive Rate	0.041	<= 0.030
Harmful prevalence	12.0%	n/a

The Problem

Task

Evaluate whether this model is ready for production given the current metrics.
Interpret the tradeoff between precision, recall, and false positive rate for this use case.
Identify the most important error patterns to investigate before launch.
Recommend how you would validate the model across harmful subcategories and user segments.
Propose concrete changes to improve safety performance without creating unacceptable user friction.

Constraints

The moderation team can manually review at most 2,500 flagged prompts per day.
False negatives in self-harm and violent-threat categories are treated as the highest-risk failures.
Product leadership will not approve launch if benign prompt block rate exceeds 5%.

Interview Guides

Context

Current Performance

The Problem

Task

Constraints

Evaluate Harmful Content Classifier

Context

Current Performance

The Problem

Task

Constraints

Your Answer

Evaluate Harmful Content Classifier

Context

Current Performance

The Problem

Task

Constraints

Evaluate Harmful Content Classifier

Context

Current Performance

The Problem

Task

Constraints

Your Answer