You are responsible for evaluating an LLM assistant that drafts responses for operations and support teams using internal policy documents, underwriting guidelines, and customer account notes. The assistant is already live in a limited rollout and handles several thousand queries per day, but stakeholders do not trust the current quality signals because they rely mostly on thumbs-up feedback. You need an evaluation framework that measures answer quality, groundedness, and safety before the team expands usage. The system may answer directly, ask a clarifying question, or refuse when the source material is insufficient.
How would you design a practical evaluation strategy for this LLM system so the team can measure response quality reliably and decide whether it is safe to scale? Explain how you would balance offline and online evaluation, define success metrics, and account for hallucination, prompt injection, latency, and cost in the final approach.