Improve RAG Answer Quality

Scenario

You are building an internal assistant that answers employee questions over policy manuals, delivery playbooks, controls documentation, and engagement guidance. The current prototype uses basic vector search plus a single LLM call, but users report slow responses, weak retrieval on acronym-heavy queries, and answers that sound plausible while citing irrelevant passages. The corpus contains about 1.2 million documents across PDF, HTML, and markdown, with frequent updates and uneven document quality. Leadership wants a production-ready RAG system that improves answer quality without materially increasing spend.

Constraints

p95 latency must stay under 2,500ms end-to-end
Cost ceiling: $0.035 per request and $45K/month at projected volume
Hallucination or unsupported-claim rate must be below 2% on a held-out golden set
Every factual answer must include grounded citations
The system must resist prompt injection in retrieved content and avoid leaking restricted content

Available Resources

Approved GPT-4-class and smaller low-cost models, plus embedding models
Hybrid search infrastructure with BM25 and vector retrieval
Document metadata including access controls, timestamps, and business unit tags
Capacity to label 800 evaluation questions and run weekly offline evals

Question

How would you improve this RAG system’s performance, and how would you evaluate whether retrieval, prompting, reranking, and model choices are actually moving quality in the right direction while staying within the latency, cost, and safety limits?

Scenario

Constraints

p95 latency must stay under 2,500ms end-to-end
Cost ceiling: $0.035 per request and $45K/month at projected volume
Hallucination or unsupported-claim rate must be below 2% on a held-out golden set
Every factual answer must include grounded citations
The system must resist prompt injection in retrieved content and avoid leaking restricted content

Available Resources

Approved GPT-4-class and smaller low-cost models, plus embedding models
Hybrid search infrastructure with BM25 and vector retrieval
Document metadata including access controls, timestamps, and business unit tags
Capacity to label 800 evaluation questions and run weekly offline evals

Question

Scenario

Constraints

p95 latency must stay under 2,500ms end-to-end
Cost ceiling: $0.035 per request and $45K/month at projected volume
Hallucination or unsupported-claim rate must be below 2% on a held-out golden set
Every factual answer must include grounded citations
The system must resist prompt injection in retrieved content and avoid leaking restricted content

Available Resources

Approved GPT-4-class and smaller low-cost models, plus embedding models
Hybrid search infrastructure with BM25 and vector retrieval
Document metadata including access controls, timestamps, and business unit tags
Capacity to label 800 evaluation questions and run weekly offline evals

Question

Scenario

Constraints

p95 latency must stay under 2,500ms end-to-end
Cost ceiling: $0.035 per request and $45K/month at projected volume
Hallucination or unsupported-claim rate must be below 2% on a held-out golden set
Every factual answer must include grounded citations
The system must resist prompt injection in retrieved content and avoid leaking restricted content

Available Resources

Approved GPT-4-class and smaller low-cost models, plus embedding models
Hybrid search infrastructure with BM25 and vector retrieval
Document metadata including access controls, timestamps, and business unit tags
Capacity to label 800 evaluation questions and run weekly offline evals

Interview Guides

Scenario

Constraints

Available Resources

Question

Improve RAG Answer Quality

Scenario

Constraints

Available Resources

Question

Your Answer

Improve RAG Answer Quality

Scenario

Constraints

Available Resources

Question

Improve RAG Answer Quality

Scenario

Constraints

Available Resources

Question

Your Answer