You are building an internal search-and-answer assistant for a government organization that needs to help staff query thousands of policy memos, directives, manuals, and procedural updates stored as PDFs, Word files, and scanned documents. The corpus contains long documents with section hierarchies, tables, footnotes, versioned revisions, and citations to statutes or prior guidance, and some records are noisy because they come from OCR. Users ask natural-language questions such as eligibility rules, approval workflows, and exceptions, and they need grounded answers with citations to the exact source passages rather than free-form summaries.
How would you design and implement a retrieval-augmented generation system for this use case, including document preprocessing, retrieval, answer generation, and evaluation so that responses stay accurate, traceable, and useful in production?