You are responsible for an internal assistant that answers analyst and engineer questions over a large corpus of operational documents, tickets, and runbooks. The system already uses retrieval-augmented generation and is being rolled out to a few hundred daily users, with plans to expand quickly. Early feedback shows the assistant is often helpful, but a small number of answers contain unsupported claims, and security reviewers are concerned about prompt injection hidden inside retrieved content. You need a practical evaluation and monitoring plan before broader launch.
How would you evaluate and monitor this LLM system for hallucinations, prompt injection, and other important failure modes while keeping latency and cost within budget, and how would you use those results to decide when the system is safe to expand?