Scenario
You are building a retrieval-backed assistant for an internal engineering org that needs to answer questions over design docs, runbooks, tickets, and wiki pages. The current keyword search misses synonyms, acronyms, and cross-document context, so users often open several results before finding an answer. The corpus is about 2 million documents with frequent updates, and the product is expected to serve both direct search and grounded LLM answers.
Constraints
- p95 end-to-end latency must stay under 2,500ms for answer generation and under 500ms for search-only queries
- Monthly serving budget must stay under $30K at 80K queries/day
- Hallucinated or unsupported factual claims must be below 2% on a labeled evaluation set
- Retrieved content may contain prompt injection attempts, stale guidance, and access-controlled material
- Every generated answer must include source citations or refuse when evidence is insufficient
Available Resources
- A corpus of 2 million internal documents with metadata, ACLs, and update timestamps
- Access to an approved embedding model, a hosted LLM API, and a vector store that supports metadata filtering
- A BM25 index and a lightweight reranker service already used by search infrastructure
- Capacity to label 1,000 evaluation queries and run online experiments with a small internal user group
Question
How would you design the retrieval system and surrounding RAG pipeline so that semantic search quality improves meaningfully over keyword search while meeting the latency, cost, and safety requirements? Explain the main design choices you would make and how you would evaluate, monitor, and harden the system against hallucination, prompt injection, and stale or unauthorized retrievals.