Rank Human Preferences for RLHF

Business Context

NovaAssist is training a customer-support language model used by 8M monthly users. The team has collected human preference data and needs a reward model that scores candidate responses so the downstream RLHF pipeline can optimize toward helpful, safe, and policy-compliant outputs.

Dataset

You are given a preference-learning dataset built from prompt-response comparisons.

Feature Group	Count	Examples
Prompt text	1	user_query, conversation_history
Candidate A text	1	model_a_response
Candidate B text	1	model_b_response
Annotation metadata	6	annotator_id, policy_version, locale, task_type, response_length_a, response_length_b
Safety / quality signals	8	toxicity_score, refusal_flag, factuality_hint, formatting_score

Size: 420K pairwise comparisons from 95K unique prompts
Target: Binary label where 1 means response A was preferred over response B
Class balance: Roughly balanced overall (52% A preferred), but skewed by task type and locale
Missing data: 12% missing in metadata-derived quality signals; some prompts have multiple conflicting annotations

Success Criteria

A strong solution should:

Achieve pairwise accuracy >= 72% on a held-out test set
Reach ROC-AUC >= 0.80 and log loss <= 0.56
Produce calibrated reward scores that can be used safely in a PPO or DPO-style RLHF pipeline
Support offline batch scoring of 5M response pairs per day

Constraints

Inference latency must stay under 25 ms per pair on GPU batch serving
The model must be auditable for safety reviews and slice analysis by locale/task type
Training budget is limited to fine-tuning a small transformer or training a lightweight pairwise model on frozen embeddings

Deliverables

Build a reward model that predicts which response humans prefer.
Define preprocessing for text, metadata, and conflicting annotations.
Choose an evaluation strategy that avoids prompt-level leakage.
Show how reward scores would be thresholded or calibrated for downstream RLHF training.
Explain deployment, monitoring, and retraining decisions.

Business Context

Dataset

You are given a preference-learning dataset built from prompt-response comparisons.

Feature Group	Count	Examples
Prompt text	1	user_query, conversation_history
Candidate A text	1	model_a_response
Candidate B text	1	model_b_response
Annotation metadata	6	annotator_id, policy_version, locale, task_type, response_length_a, response_length_b
Safety / quality signals	8	toxicity_score, refusal_flag, factuality_hint, formatting_score

Size: 420K pairwise comparisons from 95K unique prompts
Target: Binary label where 1 means response A was preferred over response B
Class balance: Roughly balanced overall (52% A preferred), but skewed by task type and locale
Missing data: 12% missing in metadata-derived quality signals; some prompts have multiple conflicting annotations

Success Criteria

A strong solution should:

Achieve pairwise accuracy >= 72% on a held-out test set
Reach ROC-AUC >= 0.80 and log loss <= 0.56
Produce calibrated reward scores that can be used safely in a PPO or DPO-style RLHF pipeline
Support offline batch scoring of 5M response pairs per day

Constraints

Inference latency must stay under 25 ms per pair on GPU batch serving
The model must be auditable for safety reviews and slice analysis by locale/task type
Training budget is limited to fine-tuning a small transformer or training a lightweight pairwise model on frozen embeddings

Deliverables

Build a reward model that predicts which response humans prefer.
Define preprocessing for text, metadata, and conflicting annotations.
Choose an evaluation strategy that avoids prompt-level leakage.
Show how reward scores would be thresholded or calibrated for downstream RLHF training.
Explain deployment, monitoring, and retraining decisions.

Business Context

Dataset

You are given a preference-learning dataset built from prompt-response comparisons.

Feature Group	Count	Examples
Prompt text	1	user_query, conversation_history
Candidate A text	1	model_a_response
Candidate B text	1	model_b_response
Annotation metadata	6	annotator_id, policy_version, locale, task_type, response_length_a, response_length_b
Safety / quality signals	8	toxicity_score, refusal_flag, factuality_hint, formatting_score

Size: 420K pairwise comparisons from 95K unique prompts
Target: Binary label where 1 means response A was preferred over response B
Class balance: Roughly balanced overall (52% A preferred), but skewed by task type and locale
Missing data: 12% missing in metadata-derived quality signals; some prompts have multiple conflicting annotations

Success Criteria

A strong solution should:

Achieve pairwise accuracy >= 72% on a held-out test set
Reach ROC-AUC >= 0.80 and log loss <= 0.56
Produce calibrated reward scores that can be used safely in a PPO or DPO-style RLHF pipeline
Support offline batch scoring of 5M response pairs per day

Constraints

Inference latency must stay under 25 ms per pair on GPU batch serving
The model must be auditable for safety reviews and slice analysis by locale/task type
Training budget is limited to fine-tuning a small transformer or training a lightweight pairwise model on frozen embeddings

Deliverables

Build a reward model that predicts which response humans prefer.
Define preprocessing for text, metadata, and conflicting annotations.
Choose an evaluation strategy that avoids prompt-level leakage.
Show how reward scores would be thresholded or calibrated for downstream RLHF training.
Explain deployment, monitoring, and retraining decisions.

Business Context

Dataset

You are given a preference-learning dataset built from prompt-response comparisons.

Feature Group	Count	Examples
Prompt text	1	user_query, conversation_history
Candidate A text	1	model_a_response
Candidate B text	1	model_b_response
Annotation metadata	6	annotator_id, policy_version, locale, task_type, response_length_a, response_length_b
Safety / quality signals	8	toxicity_score, refusal_flag, factuality_hint, formatting_score

Size: 420K pairwise comparisons from 95K unique prompts
Target: Binary label where 1 means response A was preferred over response B
Class balance: Roughly balanced overall (52% A preferred), but skewed by task type and locale
Missing data: 12% missing in metadata-derived quality signals; some prompts have multiple conflicting annotations

Success Criteria

A strong solution should:

Achieve pairwise accuracy >= 72% on a held-out test set
Reach ROC-AUC >= 0.80 and log loss <= 0.56
Produce calibrated reward scores that can be used safely in a PPO or DPO-style RLHF pipeline
Support offline batch scoring of 5M response pairs per day

Constraints

Inference latency must stay under 25 ms per pair on GPU batch serving
The model must be auditable for safety reviews and slice analysis by locale/task type
Training budget is limited to fine-tuning a small transformer or training a lightweight pairwise model on frozen embeddings

Deliverables

Build a reward model that predicts which response humans prefer.
Define preprocessing for text, metadata, and conflicting annotations.
Choose an evaluation strategy that avoids prompt-level leakage.
Show how reward scores would be thresholded or calibrated for downstream RLHF training.
Explain deployment, monitoring, and retraining decisions.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Rank Human Preferences for RLHF

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Rank Human Preferences for RLHF

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Rank Human Preferences for RLHF

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer