Context
You’re the ML lead for HelpHub, a B2B customer support platform used by ~1,200 SaaS companies. HelpHub is rolling out an LLM-powered support agent that drafts responses for human agents and, for low-risk tickets, can auto-send replies. The agent uses RAG over each customer’s private knowledge base (KB): product docs, policy pages, and past resolved tickets. HelpHub serves ~2.5M end-user tickets/month and ~90k agent-assist interactions/day.
The business stakes are high:
- Incorrect policy guidance (refunds, cancellations, data retention) can create contractual liability and churn.
- Slow responses hurt agent productivity; leadership has set a p95 latency SLO.
- LLM spend is now one of the fastest-growing COGS lines; Finance requires per-interaction cost reporting and guardrails.
You have two candidate configurations:
- Model A (Cheaper/Faster): smaller model + 3 retrieved chunks
- Model B (Safer/Slower): larger model + 6 retrieved chunks + stricter system prompt
You can use LangSmith (traces, datasets, evaluators) or build a custom harness. You have a labeled evaluation set of 2,000 historical tickets across 8 verticals (fintech, e-commerce, healthcare SaaS, HR, etc.). For each ticket, you have: (1) the user message, (2) the correct KB articles that should be cited, (3) a reference “gold” answer written by a senior support agent, and (4) a risk label: Low / Medium / High.
Current Offline Results (2,000-ticket eval)
| Metric | Model A | Model B | Target / Constraint |
|---|
| Hallucination rate (overall) | 3.8% | 1.9% | ≤ 2.0% overall |
| Hallucination rate (High risk) | 7.4% | 3.1% | ≤ 1.0% high-risk (auto-send) |
| Citation coverage (has ≥1 valid KB citation) | 86% | 93% | ≥ 95% |
| Answer acceptance by agents (offline proxy) | 62% | 69% | ≥ 70% |
| p50 latency (end-to-end) | 1.6s | 2.4s | p50 ≤ 2.0s |
| p95 latency (end-to-end) | 4.9s | 7.8s | p95 ≤ 6.0s |
| Avg prompt tokens | 1,150 | 1,980 | — |
| Avg completion tokens | 260 | 320 | — |
| Estimated cost / interaction | $0.006 | $0.021 | ≤ $0.015 |
The Problem
Leadership wants to launch auto-send for Low-risk tickets next month and expand to Medium-risk next quarter. However, the metrics show conflicting tradeoffs: Model B reduces hallucinations but violates latency and cost constraints; Model A meets cost/latency but fails safety targets—especially on high-risk tickets.
Your Task
- Define hallucination rate precisely for this product (what counts as a hallucination vs. acceptable paraphrase vs. missing citation). Propose at least two complementary hallucination metrics (e.g., claim-level vs. answer-level).
- Design an evaluation approach using LangSmith or a custom harness that measures, per model/config:
- hallucination rate (overall + by risk tier + by vertical)
- latency distribution (p50/p95) broken down by pipeline stage (retrieval, rerank, generation)
- cost (token-based + fixed overhead) and variance across ticket types
- Given the table above, diagnose the most likely root causes of:
- high hallucination in high-risk tickets
- poor citation coverage
- p95 latency blowups
- Propose a release decision (ship Model A, ship Model B, hybrid, or delay) and justify it with a clear metric-driven argument.
- Propose a plan to validate in production (online evaluation): what to log, how to sample, what human review is needed, and what guardrails/rollback criteria you’d set.
Constraints
- Auto-send is allowed only when the system is high confidence and citations are present; otherwise it must be draft-only.
- Human labeling budget is 200 tickets/week for deep review.
- Some customers (healthcare SaaS) require auditability: you must store traces and citations for 30 days.
- You cannot exceed $40k/month incremental LLM spend at current volume (~90k interactions/day).