RLHF Safety Tuning for Customer Support LLM

Business Context

You’re on the Applied AI team at MercuryPay, a fintech with 18M monthly active users and ~220M customer-support chat messages/month across card disputes, account recovery, and chargeback workflows. MercuryPay is rolling out an LLM-based support agent to deflect tickets and reduce handle time. A recent red-team exercise found the model sometimes: (a) provides policy-violating financial advice, (b) reveals sensitive personal data when prompted, and (c) can be jailbroken into generating instructions for fraud.

Leadership wants a concrete plan for how Reinforcement Learning from Human Feedback (RLHF) improves safety beyond supervised fine-tuning, and how you would implement and evaluate it in a production setting.

Dataset

You have three data sources collected over 6 weeks:

Data Source	Scale	Schema	Notes
Prompt–response pairs (SFT)	1.8M	(prompt, response)	Mix of agent-written and model-written responses; ~12% contain policy disclaimers
Preference comparisons	420K	(prompt, response_A, response_B, preferred, reason_code)	Labeled by trained vendors; 9 reason codes (e.g., “privacy”, “fraud enablement”, “medical/financial advice”)
Safety annotations	160K	(prompt, response, safety_label, severity)	safety_label ∈ {safe, unsafe}; severity 1–5; unsafe ~7.5%

Additional characteristics:

Long-tail risk: Most “unsafe” examples are rare jailbreak patterns; distribution shifts weekly as attackers adapt.
PII: ~3% of prompts contain PII (names, addresses, last-4 SSN); must not be memorized or echoed.
Multi-objective: Product cares about helpfulness (resolution rate) and safety (policy compliance).

Success Criteria

Reduce Sev-4/5 unsafe rate on an internal red-team suite from 1.2% → <0.2%.
Maintain task success (human-rated helpfulness) within -2% relative to the SFT baseline.
For high-risk intents (account takeover, chargebacks), achieve refusal correctness (refuse when required, comply when allowed) of > 97%.
No regression in latency: p95 generation latency must remain < 900 ms at 40 tokens/sec on existing GPU fleet.

Constraints

Regulatory & auditability: You must produce an auditable safety report (why the model refuses, what policies it follows).
Data budget: Only 10K new human preference labels/week are feasible.
Compute: RL training must fit within 8×A100 for 24 hours per iteration.
Deployment: The model ships behind a feature flag; online evaluation via shadow traffic and limited rollout.

Deliverables (What you must produce)

Explain, concretely, how RLHF improves LLM safety compared to SFT-only (mechanism + failure modes).
Propose an end-to-end RLHF pipeline: data collection, reward modeling, RL optimization, and safety guardrails.
Define an evaluation plan with offline metrics (safety + helpfulness) and online metrics (user impact + incident monitoring).
Describe how you would handle imbalance (unsafe is ~7.5%), label noise, and distribution shift.
Provide a minimal implementation sketch (reward model + PPO/DPO-style optimization) and how you’d tune it.

Business Context

Dataset

You have three data sources collected over 6 weeks:

Data Source	Scale	Schema	Notes
Prompt–response pairs (SFT)	1.8M	(prompt, response)	Mix of agent-written and model-written responses; ~12% contain policy disclaimers
Preference comparisons	420K	(prompt, response_A, response_B, preferred, reason_code)	Labeled by trained vendors; 9 reason codes (e.g., “privacy”, “fraud enablement”, “medical/financial advice”)
Safety annotations	160K	(prompt, response, safety_label, severity)	safety_label ∈ {safe, unsafe}; severity 1–5; unsafe ~7.5%

Additional characteristics:

Long-tail risk: Most “unsafe” examples are rare jailbreak patterns; distribution shifts weekly as attackers adapt.
PII: ~3% of prompts contain PII (names, addresses, last-4 SSN); must not be memorized or echoed.
Multi-objective: Product cares about helpfulness (resolution rate) and safety (policy compliance).

Success Criteria

Reduce Sev-4/5 unsafe rate on an internal red-team suite from 1.2% → <0.2%.
Maintain task success (human-rated helpfulness) within -2% relative to the SFT baseline.
For high-risk intents (account takeover, chargebacks), achieve refusal correctness (refuse when required, comply when allowed) of > 97%.
No regression in latency: p95 generation latency must remain < 900 ms at 40 tokens/sec on existing GPU fleet.

Constraints

Regulatory & auditability: You must produce an auditable safety report (why the model refuses, what policies it follows).
Data budget: Only 10K new human preference labels/week are feasible.
Compute: RL training must fit within 8×A100 for 24 hours per iteration.
Deployment: The model ships behind a feature flag; online evaluation via shadow traffic and limited rollout.

Deliverables (What you must produce)

Explain, concretely, how RLHF improves LLM safety compared to SFT-only (mechanism + failure modes).
Propose an end-to-end RLHF pipeline: data collection, reward modeling, RL optimization, and safety guardrails.
Define an evaluation plan with offline metrics (safety + helpfulness) and online metrics (user impact + incident monitoring).
Describe how you would handle imbalance (unsafe is ~7.5%), label noise, and distribution shift.
Provide a minimal implementation sketch (reward model + PPO/DPO-style optimization) and how you’d tune it.

Business Context

Dataset

You have three data sources collected over 6 weeks:

Data Source	Scale	Schema	Notes
Prompt–response pairs (SFT)	1.8M	(prompt, response)	Mix of agent-written and model-written responses; ~12% contain policy disclaimers
Preference comparisons	420K	(prompt, response_A, response_B, preferred, reason_code)	Labeled by trained vendors; 9 reason codes (e.g., “privacy”, “fraud enablement”, “medical/financial advice”)
Safety annotations	160K	(prompt, response, safety_label, severity)	safety_label ∈ {safe, unsafe}; severity 1–5; unsafe ~7.5%

Additional characteristics:

Long-tail risk: Most “unsafe” examples are rare jailbreak patterns; distribution shifts weekly as attackers adapt.
PII: ~3% of prompts contain PII (names, addresses, last-4 SSN); must not be memorized or echoed.
Multi-objective: Product cares about helpfulness (resolution rate) and safety (policy compliance).

Success Criteria

Reduce Sev-4/5 unsafe rate on an internal red-team suite from 1.2% → <0.2%.
Maintain task success (human-rated helpfulness) within -2% relative to the SFT baseline.
For high-risk intents (account takeover, chargebacks), achieve refusal correctness (refuse when required, comply when allowed) of > 97%.
No regression in latency: p95 generation latency must remain < 900 ms at 40 tokens/sec on existing GPU fleet.

Constraints

Regulatory & auditability: You must produce an auditable safety report (why the model refuses, what policies it follows).
Data budget: Only 10K new human preference labels/week are feasible.
Compute: RL training must fit within 8×A100 for 24 hours per iteration.
Deployment: The model ships behind a feature flag; online evaluation via shadow traffic and limited rollout.

Deliverables (What you must produce)

Explain, concretely, how RLHF improves LLM safety compared to SFT-only (mechanism + failure modes).
Propose an end-to-end RLHF pipeline: data collection, reward modeling, RL optimization, and safety guardrails.
Define an evaluation plan with offline metrics (safety + helpfulness) and online metrics (user impact + incident monitoring).
Describe how you would handle imbalance (unsafe is ~7.5%), label noise, and distribution shift.
Provide a minimal implementation sketch (reward model + PPO/DPO-style optimization) and how you’d tune it.

Business Context

Dataset

You have three data sources collected over 6 weeks:

Data Source	Scale	Schema	Notes
Prompt–response pairs (SFT)	1.8M	(prompt, response)	Mix of agent-written and model-written responses; ~12% contain policy disclaimers
Preference comparisons	420K	(prompt, response_A, response_B, preferred, reason_code)	Labeled by trained vendors; 9 reason codes (e.g., “privacy”, “fraud enablement”, “medical/financial advice”)
Safety annotations	160K	(prompt, response, safety_label, severity)	safety_label ∈ {safe, unsafe}; severity 1–5; unsafe ~7.5%

Additional characteristics:

Long-tail risk: Most “unsafe” examples are rare jailbreak patterns; distribution shifts weekly as attackers adapt.
PII: ~3% of prompts contain PII (names, addresses, last-4 SSN); must not be memorized or echoed.
Multi-objective: Product cares about helpfulness (resolution rate) and safety (policy compliance).

Success Criteria

Reduce Sev-4/5 unsafe rate on an internal red-team suite from 1.2% → <0.2%.
Maintain task success (human-rated helpfulness) within -2% relative to the SFT baseline.
For high-risk intents (account takeover, chargebacks), achieve refusal correctness (refuse when required, comply when allowed) of > 97%.
No regression in latency: p95 generation latency must remain < 900 ms at 40 tokens/sec on existing GPU fleet.

Constraints

Regulatory & auditability: You must produce an auditable safety report (why the model refuses, what policies it follows).
Data budget: Only 10K new human preference labels/week are feasible.
Compute: RL training must fit within 8×A100 for 24 hours per iteration.
Deployment: The model ships behind a feature flag; online evaluation via shadow traffic and limited rollout.

Deliverables (What you must produce)

Explain, concretely, how RLHF improves LLM safety compared to SFT-only (mechanism + failure modes).
Propose an end-to-end RLHF pipeline: data collection, reward modeling, RL optimization, and safety guardrails.
Define an evaluation plan with offline metrics (safety + helpfulness) and online metrics (user impact + incident monitoring).
Describe how you would handle imbalance (unsafe is ~7.5%), label noise, and distribution shift.
Provide a minimal implementation sketch (reward model + PPO/DPO-style optimization) and how you’d tune it.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables (What you must produce)

RLHF Safety Tuning for Customer Support LLM

Business Context

Dataset

Success Criteria

Constraints

Deliverables (What you must produce)

Your Answer

RLHF Safety Tuning for Customer Support LLM

Business Context

Dataset

Success Criteria

Constraints

Deliverables (What you must produce)

RLHF Safety Tuning for Customer Support LLM

Business Context

Dataset

Success Criteria

Constraints

Deliverables (What you must produce)

Your Answer