You are improving an LLM-powered assistant that answers questions about financing policies, underwriting guidelines, and customer account workflows for internal users. The current system produces fluent answers, but reviewers have found that some responses include unsupported claims or confidently fill in missing details. Usage is growing to a few thousand questions per day, and incorrect answers can create operational and compliance risk. You need to reduce hallucinations without making the product too slow or too expensive.
How would you redesign this application to reduce hallucinations in production while staying within the latency and cost limits? Explain the approach you would take to grounding, prompting, evaluation, and runtime safeguards, including how you would handle missing evidence and prompt-injection risk.