Interview Guides

Design a low-latency RAG assistant for OpenAI API docs and incident runbooks

Medium

ML System Design

You are asked to design a production retrieval-augmented generation assistant used internally by support and engineering to answer questions over OpenAI API docs, changelogs, incident runbooks, and policy notes. The system must handle 120 QPS steady state and 400 QPS peak, serve interactive chat with P95 end-to-end latency under 900ms for retrieval plus first-token, and stay under an average infrastructure cost of $0.012 per request excluding model-token charges; assume a corpus of 80 million chunks with daily updates and a hard cap of 12 GPU inference nodes plus a vector store budget that fits in 2TB RAM-equivalent. Walk through ingestion, chunking, embedding generation, vector search, reranking, prompt construction, caching, and fallback behavior when retrieval is stale or low-confidence, and explain how you would choose embeddings, ANN indexing, and metadata filtering to meet the SLOs. Then describe an evaluation plan for both retrieval quality and answer quality, including offline metrics, online experiments, hallucination detection, and how you would monitor regressions after model or embedding upgrades.

Design a low-latency RAG assistant for OpenAI API docs and incident runbooks

Medium

ML System Design

You are asked to design a production retrieval-augmented generation assistant used internally by support and engineering to answer questions over OpenAI API docs, changelogs, incident runbooks, and policy notes. The system must handle 120 QPS steady state and 400 QPS peak, serve interactive chat with P95 end-to-end latency under 900ms for retrieval plus first-token, and stay under an average infrastructure cost of $0.012 per request excluding model-token charges; assume a corpus of 80 million chunks with daily updates and a hard cap of 12 GPU inference nodes plus a vector store budget that fits in 2TB RAM-equivalent. Walk through ingestion, chunking, embedding generation, vector search, reranking, prompt construction, caching, and fallback behavior when retrieval is stale or low-confidence, and explain how you would choose embeddings, ANN indexing, and metadata filtering to meet the SLOs. Then describe an evaluation plan for both retrieval quality and answer quality, including offline metrics, online experiments, hallucination detection, and how you would monitor regressions after model or embedding upgrades.

Your Answer

Design a low-latency RAG assistant for OpenAI API docs and incident runbooks | Dataford Interview Questions - Dataford - Ace your Interview