Business Context
You’re on the Applied AI team at MercuryPay, a fintech with 18M monthly active users and ~220M customer-support chat messages/month across card disputes, account recovery, and chargeback workflows. MercuryPay is rolling out an LLM-based support agent to deflect tickets and reduce handle time. A recent red-team exercise found the model sometimes: (a) provides policy-violating financial advice, (b) reveals sensitive personal data when prompted, and (c) can be jailbroken into generating instructions for fraud.
Leadership wants a concrete plan for how Reinforcement Learning from Human Feedback (RLHF) improves safety beyond supervised fine-tuning, and how you would implement and evaluate it in a production setting.
Dataset
You have three data sources collected over 6 weeks:
| Data Source | Scale | Schema | Notes |
|---|
| Prompt–response pairs (SFT) | 1.8M | (prompt, response) | Mix of agent-written and model-written responses; ~12% contain policy disclaimers |
| Preference comparisons | 420K | (prompt, response_A, response_B, preferred, reason_code) | Labeled by trained vendors; 9 reason codes (e.g., “privacy”, “fraud enablement”, “medical/financial advice”) |
| Safety annotations | 160K | (prompt, response, safety_label, severity) | safety_label ∈ {safe, unsafe}; severity 1–5; unsafe ~7.5% |
Additional characteristics:
- Long-tail risk: Most “unsafe” examples are rare jailbreak patterns; distribution shifts weekly as attackers adapt.
- PII: ~3% of prompts contain PII (names, addresses, last-4 SSN); must not be memorized or echoed.
- Multi-objective: Product cares about helpfulness (resolution rate) and safety (policy compliance).
Success Criteria
- Reduce Sev-4/5 unsafe rate on an internal red-team suite from 1.2% → <0.2%.
- Maintain task success (human-rated helpfulness) within -2% relative to the SFT baseline.
- For high-risk intents (account takeover, chargebacks), achieve refusal correctness (refuse when required, comply when allowed) of > 97%.
- No regression in latency: p95 generation latency must remain < 900 ms at 40 tokens/sec on existing GPU fleet.
Constraints
- Regulatory & auditability: You must produce an auditable safety report (why the model refuses, what policies it follows).
- Data budget: Only 10K new human preference labels/week are feasible.
- Compute: RL training must fit within 8×A100 for 24 hours per iteration.
- Deployment: The model ships behind a feature flag; online evaluation via shadow traffic and limited rollout.
Deliverables (What you must produce)
- Explain, concretely, how RLHF improves LLM safety compared to SFT-only (mechanism + failure modes).
- Propose an end-to-end RLHF pipeline: data collection, reward modeling, RL optimization, and safety guardrails.
- Define an evaluation plan with offline metrics (safety + helpfulness) and online metrics (user impact + incident monitoring).
- Describe how you would handle imbalance (unsafe is ~7.5%), label noise, and distribution shift.
- Provide a minimal implementation sketch (reward model + PPO/DPO-style optimization) and how you’d tune it.