OpenAI is training models to score whether prompts or responses violate a moderation policy. Some downstream systems need a binary classification probability (safe vs unsafe), while others need a continuous severity score for ranking and triage. You need to compare cross-entropy loss and mean squared error (MSE), decide when each should be used, and demonstrate the impact on model behavior.
You are given a labeled moderation dataset built from OpenAI safety review workflows.
| Feature Group | Count | Examples |
|---|---|---|
| Embedding features | 1536 | text_embedding_0 ... text_embedding_1535 from text-embedding-3-large |
| Metadata | 6 | language, source_surface, prompt_length, response_length, user_reported |
| Policy labels | 2 | unsafe_binary, severity_score |
unsafe_binary: 1 if content violates policy, else 0severity_score: continuous score in [0, 1] from human review aggregationunsafe_binary is imbalanced — 11% positive, 89% negativelanguage and user_reported; embeddings are completeA strong solution should:
unsafe_binaryseverity_score