Evaluate a Grounded Support Assistant

Scenario

You are responsible for evaluating an LLM assistant that drafts responses for operations and support teams using internal policy documents, underwriting guidelines, and customer account notes. The assistant is already live in a limited rollout and handles several thousand queries per day, but stakeholders do not trust the current quality signals because they rely mostly on thumbs-up feedback. You need an evaluation framework that measures answer quality, groundedness, and safety before the team expands usage. The system may answer directly, ask a clarifying question, or refuse when the source material is insufficient.

Constraints

p95 end-to-end latency for the assistant must stay under 2,500ms
Evaluation should support a cost ceiling of $12K/month at current traffic
Hallucinated factual claims must stay below 2% on a reviewed golden set
Prompt injection from retrieved documents or user messages must be treated as a real production risk
Responses must not expose sensitive customer or financial data beyond the user’s access scope

Available Resources

Historical prompts, retrieved context, model outputs, and agent traces from the pilot
Internal policy and account-document corpus with citation metadata
A GPT-4-class model and a smaller low-cost model for automated grading
20 hours per week of expert reviewer time for labeling and adjudication

Question

How would you design a practical evaluation strategy for this LLM system so the team can measure response quality reliably and decide whether it is safe to scale? Explain how you would balance offline and online evaluation, define success metrics, and account for hallucination, prompt injection, latency, and cost in the final approach.

Scenario

Constraints

p95 end-to-end latency for the assistant must stay under 2,500ms

Evaluation should support a cost ceiling of $12K/month at current traffic

Hallucinated factual claims must stay below 2% on a reviewed golden set

Prompt injection from retrieved documents or user messages must be treated as a real production risk

Responses must not expose sensitive customer or financial data beyond the user’s access scope

Question

Scenario

Constraints

p95 end-to-end latency for the assistant must stay under 2,500ms

Evaluation should support a cost ceiling of $12K/month at current traffic

Hallucinated factual claims must stay below 2% on a reviewed golden set

Prompt injection from retrieved documents or user messages must be treated as a real production risk

Responses must not expose sensitive customer or financial data beyond the user’s access scope

Question

Scenario

Constraints

p95 end-to-end latency for the assistant must stay under 2,500ms

Evaluation should support a cost ceiling of $12K/month at current traffic

Hallucinated factual claims must stay below 2% on a reviewed golden set

Prompt injection from retrieved documents or user messages must be treated as a real production risk

Responses must not expose sensitive customer or financial data beyond the user’s access scope

Question

Interview Guides

Scenario

Constraints

Available Resources

Question

Evaluate a Grounded Support Assistant

Scenario

Constraints

Available Resources

Question

Your Answer

Evaluate a Grounded Support Assistant

Scenario

Constraints

Available Resources

Question

Evaluate a Grounded Support Assistant

Scenario

Constraints

Available Resources

Question

Your Answer