Choose Loss for Moderation Models

Business Context

OpenAI is training models to score whether prompts or responses violate a moderation policy. Some downstream systems need a binary classification probability (safe vs unsafe), while others need a continuous severity score for ranking and triage. You need to compare cross-entropy loss and mean squared error (MSE), decide when each should be used, and demonstrate the impact on model behavior.

Dataset

You are given a labeled moderation dataset built from OpenAI safety review workflows.

Feature Group	Count	Examples
Embedding features	1536	`text_embedding_0 ... text_embedding_1535` from `text-embedding-3-large`
Metadata	6	language, source_surface, prompt_length, response_length, user_reported
Policy labels	2	`unsafe_binary`, `severity_score`

Size: 240K examples, 1,542 input features after preprocessing
Targets:
- unsafe_binary: 1 if content violates policy, else 0
- severity_score: continuous score in [0, 1] from human review aggregation
Class balance: unsafe_binary is imbalanced — 11% positive, 89% negative
Missing data: 8% missing in metadata fields such as language and user_reported; embeddings are complete

Success Criteria

A strong solution should:

Show why cross-entropy is the correct default for probabilistic binary classification
Show why MSE is appropriate for continuous regression targets like severity
Quantify the effect of using the wrong loss on calibration, ranking, and thresholded decisions

Constraints

Batch inference must score 5M examples/day
The binary model should produce calibrated probabilities for policy thresholding
The approach should be simple enough to retrain weekly and explain to safety stakeholders

Deliverables

Train a binary classifier with cross-entropy loss for unsafe_binary
Train a regression model with MSE loss for severity_score
Train a comparison baseline that incorrectly uses MSE for the binary target
Evaluate all models with appropriate metrics and explain when each loss should be used
Recommend which loss to deploy for moderation classification vs severity ranking

Business Context

Dataset

You are given a labeled moderation dataset built from OpenAI safety review workflows.

Feature Group	Count	Examples
Embedding features	1536	`text_embedding_0 ... text_embedding_1535` from `text-embedding-3-large`
Metadata	6	language, source_surface, prompt_length, response_length, user_reported
Policy labels	2	`unsafe_binary`, `severity_score`

Size: 240K examples, 1,542 input features after preprocessing
Targets:
- unsafe_binary: 1 if content violates policy, else 0
- severity_score: continuous score in [0, 1] from human review aggregation
Class balance: unsafe_binary is imbalanced — 11% positive, 89% negative
Missing data: 8% missing in metadata fields such as language and user_reported; embeddings are complete

Success Criteria

A strong solution should:

Show why cross-entropy is the correct default for probabilistic binary classification
Show why MSE is appropriate for continuous regression targets like severity
Quantify the effect of using the wrong loss on calibration, ranking, and thresholded decisions

Constraints

Batch inference must score 5M examples/day
The binary model should produce calibrated probabilities for policy thresholding
The approach should be simple enough to retrain weekly and explain to safety stakeholders

Deliverables

Train a binary classifier with cross-entropy loss for unsafe_binary
Train a regression model with MSE loss for severity_score
Train a comparison baseline that incorrectly uses MSE for the binary target
Evaluate all models with appropriate metrics and explain when each loss should be used
Recommend which loss to deploy for moderation classification vs severity ranking

Business Context

Dataset

You are given a labeled moderation dataset built from OpenAI safety review workflows.

Feature Group	Count	Examples
Embedding features	1536	`text_embedding_0 ... text_embedding_1535` from `text-embedding-3-large`
Metadata	6	language, source_surface, prompt_length, response_length, user_reported
Policy labels	2	`unsafe_binary`, `severity_score`

Size: 240K examples, 1,542 input features after preprocessing
Targets:
- unsafe_binary: 1 if content violates policy, else 0
- severity_score: continuous score in [0, 1] from human review aggregation
Class balance: unsafe_binary is imbalanced — 11% positive, 89% negative
Missing data: 8% missing in metadata fields such as language and user_reported; embeddings are complete

Success Criteria

A strong solution should:

Show why cross-entropy is the correct default for probabilistic binary classification
Show why MSE is appropriate for continuous regression targets like severity
Quantify the effect of using the wrong loss on calibration, ranking, and thresholded decisions

Constraints

Batch inference must score 5M examples/day
The binary model should produce calibrated probabilities for policy thresholding
The approach should be simple enough to retrain weekly and explain to safety stakeholders

Deliverables

Train a binary classifier with cross-entropy loss for unsafe_binary
Train a regression model with MSE loss for severity_score
Train a comparison baseline that incorrectly uses MSE for the binary target
Evaluate all models with appropriate metrics and explain when each loss should be used
Recommend which loss to deploy for moderation classification vs severity ranking

Business Context

Dataset

You are given a labeled moderation dataset built from OpenAI safety review workflows.

Feature Group	Count	Examples
Embedding features	1536	`text_embedding_0 ... text_embedding_1535` from `text-embedding-3-large`
Metadata	6	language, source_surface, prompt_length, response_length, user_reported
Policy labels	2	`unsafe_binary`, `severity_score`

Size: 240K examples, 1,542 input features after preprocessing
Targets:
- unsafe_binary: 1 if content violates policy, else 0
- severity_score: continuous score in [0, 1] from human review aggregation
Class balance: unsafe_binary is imbalanced — 11% positive, 89% negative
Missing data: 8% missing in metadata fields such as language and user_reported; embeddings are complete

Success Criteria

A strong solution should:

Show why cross-entropy is the correct default for probabilistic binary classification
Show why MSE is appropriate for continuous regression targets like severity
Quantify the effect of using the wrong loss on calibration, ranking, and thresholded decisions

Constraints

Batch inference must score 5M examples/day
The binary model should produce calibrated probabilities for policy thresholding
The approach should be simple enough to retrain weekly and explain to safety stakeholders

Deliverables

Train a binary classifier with cross-entropy loss for unsafe_binary
Train a regression model with MSE loss for severity_score
Train a comparison baseline that incorrectly uses MSE for the binary target
Evaluate all models with appropriate metrics and explain when each loss should be used
Recommend which loss to deploy for moderation classification vs severity ranking

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Choose Loss for Moderation Models

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Choose Loss for Moderation Models

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Choose Loss for Moderation Models

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer