Evaluate RAG Hallucinations, Latency, Cost

Context

You’re the ML lead for HelpHub, a B2B customer support platform used by ~1,200 SaaS companies. HelpHub is rolling out an LLM-powered support agent that drafts responses for human agents and, for low-risk tickets, can auto-send replies. The agent uses RAG over each customer’s private knowledge base (KB): product docs, policy pages, and past resolved tickets. HelpHub serves ~2.5M end-user tickets/month and ~90k agent-assist interactions/day.

The business stakes are high:

Incorrect policy guidance (refunds, cancellations, data retention) can create contractual liability and churn.
Slow responses hurt agent productivity; leadership has set a p95 latency SLO.
LLM spend is now one of the fastest-growing COGS lines; Finance requires per-interaction cost reporting and guardrails.

You have two candidate configurations:

Model A (Cheaper/Faster): smaller model + 3 retrieved chunks
Model B (Safer/Slower): larger model + 6 retrieved chunks + stricter system prompt

You can use LangSmith (traces, datasets, evaluators) or build a custom harness. You have a labeled evaluation set of 2,000 historical tickets across 8 verticals (fintech, e-commerce, healthcare SaaS, HR, etc.). For each ticket, you have: (1) the user message, (2) the correct KB articles that should be cited, (3) a reference “gold” answer written by a senior support agent, and (4) a risk label: Low / Medium / High.

Current Offline Results (2,000-ticket eval)

Metric	Model A	Model B	Target / Constraint
Hallucination rate (overall)	3.8%	1.9%	≤ 2.0% overall
Hallucination rate (High risk)	7.4%	3.1%	≤ 1.0% high-risk (auto-send)
Citation coverage (has ≥1 valid KB citation)	86%	93%	≥ 95%
Answer acceptance by agents (offline proxy)	62%	69%	≥ 70%
p50 latency (end-to-end)	1.6s	2.4s	p50 ≤ 2.0s
p95 latency (end-to-end)	4.9s	7.8s	p95 ≤ 6.0s
Avg prompt tokens	1,150	1,980	—
Avg completion tokens	260	320	—
Estimated cost / interaction	$0.006	$0.021	≤ $0.015

The Problem

Leadership wants to launch auto-send for Low-risk tickets next month and expand to Medium-risk next quarter. However, the metrics show conflicting tradeoffs: Model B reduces hallucinations but violates latency and cost constraints; Model A meets cost/latency but fails safety targets—especially on high-risk tickets.

Your Task

Define hallucination rate precisely for this product (what counts as a hallucination vs. acceptable paraphrase vs. missing citation). Propose at least two complementary hallucination metrics (e.g., claim-level vs. answer-level).
Design an evaluation approach using LangSmith or a custom harness that measures, per model/config:
- hallucination rate (overall + by risk tier + by vertical)
- latency distribution (p50/p95) broken down by pipeline stage (retrieval, rerank, generation)
- cost (token-based + fixed overhead) and variance across ticket types
Given the table above, diagnose the most likely root causes of:
- high hallucination in high-risk tickets
- poor citation coverage
- p95 latency blowups
Propose a release decision (ship Model A, ship Model B, hybrid, or delay) and justify it with a clear metric-driven argument.
Propose a plan to validate in production (online evaluation): what to log, how to sample, what human review is needed, and what guardrails/rollback criteria you’d set.

Constraints

Auto-send is allowed only when the system is high confidence and citations are present; otherwise it must be draft-only.
Human labeling budget is 200 tickets/week for deep review.
Some customers (healthcare SaaS) require auditability: you must store traces and citations for 30 days.
You cannot exceed $40k/month incremental LLM spend at current volume (~90k interactions/day).

Context

The business stakes are high:

Incorrect policy guidance (refunds, cancellations, data retention) can create contractual liability and churn.
Slow responses hurt agent productivity; leadership has set a p95 latency SLO.
LLM spend is now one of the fastest-growing COGS lines; Finance requires per-interaction cost reporting and guardrails.

You have two candidate configurations:

Model A (Cheaper/Faster): smaller model + 3 retrieved chunks
Model B (Safer/Slower): larger model + 6 retrieved chunks + stricter system prompt

Current Offline Results (2,000-ticket eval)

Metric	Model A	Model B	Target / Constraint
Hallucination rate (overall)	3.8%	1.9%	≤ 2.0% overall
Hallucination rate (High risk)	7.4%	3.1%	≤ 1.0% high-risk (auto-send)
Citation coverage (has ≥1 valid KB citation)	86%	93%	≥ 95%
Answer acceptance by agents (offline proxy)	62%	69%	≥ 70%
p50 latency (end-to-end)	1.6s	2.4s	p50 ≤ 2.0s
p95 latency (end-to-end)	4.9s	7.8s	p95 ≤ 6.0s
Avg prompt tokens	1,150	1,980	—
Avg completion tokens	260	320	—
Estimated cost / interaction	$0.006	$0.021	≤ $0.015

The Problem

Your Task

Define hallucination rate precisely for this product (what counts as a hallucination vs. acceptable paraphrase vs. missing citation). Propose at least two complementary hallucination metrics (e.g., claim-level vs. answer-level).
Design an evaluation approach using LangSmith or a custom harness that measures, per model/config:
- hallucination rate (overall + by risk tier + by vertical)
- latency distribution (p50/p95) broken down by pipeline stage (retrieval, rerank, generation)
- cost (token-based + fixed overhead) and variance across ticket types
Given the table above, diagnose the most likely root causes of:
- high hallucination in high-risk tickets
- poor citation coverage
- p95 latency blowups
Propose a release decision (ship Model A, ship Model B, hybrid, or delay) and justify it with a clear metric-driven argument.
Propose a plan to validate in production (online evaluation): what to log, how to sample, what human review is needed, and what guardrails/rollback criteria you’d set.

Constraints

Auto-send is allowed only when the system is high confidence and citations are present; otherwise it must be draft-only.
Human labeling budget is 200 tickets/week for deep review.
Some customers (healthcare SaaS) require auditability: you must store traces and citations for 30 days.
You cannot exceed $40k/month incremental LLM spend at current volume (~90k interactions/day).

Context

The business stakes are high:

Incorrect policy guidance (refunds, cancellations, data retention) can create contractual liability and churn.
Slow responses hurt agent productivity; leadership has set a p95 latency SLO.
LLM spend is now one of the fastest-growing COGS lines; Finance requires per-interaction cost reporting and guardrails.

You have two candidate configurations:

Model A (Cheaper/Faster): smaller model + 3 retrieved chunks
Model B (Safer/Slower): larger model + 6 retrieved chunks + stricter system prompt

Current Offline Results (2,000-ticket eval)

Metric	Model A	Model B	Target / Constraint
Hallucination rate (overall)	3.8%	1.9%	≤ 2.0% overall
Hallucination rate (High risk)	7.4%	3.1%	≤ 1.0% high-risk (auto-send)
Citation coverage (has ≥1 valid KB citation)	86%	93%	≥ 95%
Answer acceptance by agents (offline proxy)	62%	69%	≥ 70%
p50 latency (end-to-end)	1.6s	2.4s	p50 ≤ 2.0s
p95 latency (end-to-end)	4.9s	7.8s	p95 ≤ 6.0s
Avg prompt tokens	1,150	1,980	—
Avg completion tokens	260	320	—
Estimated cost / interaction	$0.006	$0.021	≤ $0.015

The Problem

Your Task

Define hallucination rate precisely for this product (what counts as a hallucination vs. acceptable paraphrase vs. missing citation). Propose at least two complementary hallucination metrics (e.g., claim-level vs. answer-level).
Design an evaluation approach using LangSmith or a custom harness that measures, per model/config:
- hallucination rate (overall + by risk tier + by vertical)
- latency distribution (p50/p95) broken down by pipeline stage (retrieval, rerank, generation)
- cost (token-based + fixed overhead) and variance across ticket types
Given the table above, diagnose the most likely root causes of:
- high hallucination in high-risk tickets
- poor citation coverage
- p95 latency blowups
Propose a release decision (ship Model A, ship Model B, hybrid, or delay) and justify it with a clear metric-driven argument.
Propose a plan to validate in production (online evaluation): what to log, how to sample, what human review is needed, and what guardrails/rollback criteria you’d set.

Constraints

Auto-send is allowed only when the system is high confidence and citations are present; otherwise it must be draft-only.
Human labeling budget is 200 tickets/week for deep review.
Some customers (healthcare SaaS) require auditability: you must store traces and citations for 30 days.
You cannot exceed $40k/month incremental LLM spend at current volume (~90k interactions/day).

Context

The business stakes are high:

Incorrect policy guidance (refunds, cancellations, data retention) can create contractual liability and churn.
Slow responses hurt agent productivity; leadership has set a p95 latency SLO.
LLM spend is now one of the fastest-growing COGS lines; Finance requires per-interaction cost reporting and guardrails.

You have two candidate configurations:

Model A (Cheaper/Faster): smaller model + 3 retrieved chunks
Model B (Safer/Slower): larger model + 6 retrieved chunks + stricter system prompt

Current Offline Results (2,000-ticket eval)

Metric	Model A	Model B	Target / Constraint
Hallucination rate (overall)	3.8%	1.9%	≤ 2.0% overall
Hallucination rate (High risk)	7.4%	3.1%	≤ 1.0% high-risk (auto-send)
Citation coverage (has ≥1 valid KB citation)	86%	93%	≥ 95%
Answer acceptance by agents (offline proxy)	62%	69%	≥ 70%
p50 latency (end-to-end)	1.6s	2.4s	p50 ≤ 2.0s
p95 latency (end-to-end)	4.9s	7.8s	p95 ≤ 6.0s
Avg prompt tokens	1,150	1,980	—
Avg completion tokens	260	320	—
Estimated cost / interaction	$0.006	$0.021	≤ $0.015

The Problem

Your Task

Define hallucination rate precisely for this product (what counts as a hallucination vs. acceptable paraphrase vs. missing citation). Propose at least two complementary hallucination metrics (e.g., claim-level vs. answer-level).
Design an evaluation approach using LangSmith or a custom harness that measures, per model/config:
- hallucination rate (overall + by risk tier + by vertical)
- latency distribution (p50/p95) broken down by pipeline stage (retrieval, rerank, generation)
- cost (token-based + fixed overhead) and variance across ticket types
Given the table above, diagnose the most likely root causes of:
- high hallucination in high-risk tickets
- poor citation coverage
- p95 latency blowups
Propose a release decision (ship Model A, ship Model B, hybrid, or delay) and justify it with a clear metric-driven argument.
Propose a plan to validate in production (online evaluation): what to log, how to sample, what human review is needed, and what guardrails/rollback criteria you’d set.

Constraints

Auto-send is allowed only when the system is high confidence and citations are present; otherwise it must be draft-only.
Human labeling budget is 200 tickets/week for deep review.
Some customers (healthcare SaaS) require auditability: you must store traces and citations for 30 days.
You cannot exceed $40k/month incremental LLM spend at current volume (~90k interactions/day).

Interview Guides

Context

Current Offline Results (2,000-ticket eval)

The Problem

Your Task

Constraints

Evaluate RAG Hallucinations, Latency, Cost

Context

Current Offline Results (2,000-ticket eval)

The Problem

Your Task

Constraints

Your Answer

Evaluate RAG Hallucinations, Latency, Cost

Context

Current Offline Results (2,000-ticket eval)

The Problem

Your Task

Constraints

Evaluate RAG Hallucinations, Latency, Cost

Context

Current Offline Results (2,000-ticket eval)

The Problem

Your Task

Constraints

Your Answer