Business Context
NovaAssist is training a customer-support language model used by 8M monthly users. The team has collected human preference data and needs a reward model that scores candidate responses so the downstream RLHF pipeline can optimize toward helpful, safe, and policy-compliant outputs.
Dataset
You are given a preference-learning dataset built from prompt-response comparisons.
| Feature Group | Count | Examples |
|---|
| Prompt text | 1 | user_query, conversation_history |
| Candidate A text | 1 | model_a_response |
| Candidate B text | 1 | model_b_response |
| Annotation metadata | 6 | annotator_id, policy_version, locale, task_type, response_length_a, response_length_b |
| Safety / quality signals | 8 | toxicity_score, refusal_flag, factuality_hint, formatting_score |
- Size: 420K pairwise comparisons from 95K unique prompts
- Target: Binary label where
1 means response A was preferred over response B
- Class balance: Roughly balanced overall (52% A preferred), but skewed by task type and locale
- Missing data: 12% missing in metadata-derived quality signals; some prompts have multiple conflicting annotations
Success Criteria
A strong solution should:
- Achieve pairwise accuracy >= 72% on a held-out test set
- Reach ROC-AUC >= 0.80 and log loss <= 0.56
- Produce calibrated reward scores that can be used safely in a PPO or DPO-style RLHF pipeline
- Support offline batch scoring of 5M response pairs per day
Constraints
- Inference latency must stay under 25 ms per pair on GPU batch serving
- The model must be auditable for safety reviews and slice analysis by locale/task type
- Training budget is limited to fine-tuning a small transformer or training a lightweight pairwise model on frozen embeddings
Deliverables
- Build a reward model that predicts which response humans prefer.
- Define preprocessing for text, metadata, and conflicting annotations.
- Choose an evaluation strategy that avoids prompt-level leakage.
- Show how reward scores would be thresholded or calibrated for downstream RLHF training.
- Explain deployment, monitoring, and retraining decisions.