You are building an internal assistant that answers employee questions over policy manuals, delivery playbooks, controls documentation, and engagement guidance. The current prototype uses basic vector search plus a single LLM call, but users report slow responses, weak retrieval on acronym-heavy queries, and answers that sound plausible while citing irrelevant passages. The corpus contains about 1.2 million documents across PDF, HTML, and markdown, with frequent updates and uneven document quality. Leadership wants a production-ready RAG system that improves answer quality without materially increasing spend.
How would you improve this RAG system’s performance, and how would you evaluate whether retrieval, prompting, reranking, and model choices are actually moving quality in the right direction while staying within the latency, cost, and safety limits?