Interview Guides

Design Semantic Search for Internal Knowledge

Hard

Generative AI & LLMs

Scenario

You are building a retrieval-backed assistant for an internal engineering org that needs to answer questions over design docs, runbooks, tickets, and wiki pages. The current keyword search misses synonyms, acronyms, and cross-document context, so users often open several results before finding an answer. The corpus is about 2 million documents with frequent updates, and the product is expected to serve both direct search and grounded LLM answers.

Constraints

p95 end-to-end latency must stay under 2,500ms for answer generation and under 500ms for search-only queries
Monthly serving budget must stay under $30K at 80K queries/day
Hallucinated or unsupported factual claims must be below 2% on a labeled evaluation set
Retrieved content may contain prompt injection attempts, stale guidance, and access-controlled material
Every generated answer must include source citations or refuse when evidence is insufficient

Available Resources

A corpus of 2 million internal documents with metadata, ACLs, and update timestamps
Access to an approved embedding model, a hosted LLM API, and a vector store that supports metadata filtering
A BM25 index and a lightweight reranker service already used by search infrastructure
Capacity to label 1,000 evaluation queries and run online experiments with a small internal user group

Question

How would you design the retrieval system and surrounding RAG pipeline so that semantic search quality improves meaningfully over keyword search while meeting the latency, cost, and safety requirements? Explain the main design choices you would make and how you would evaluate, monitor, and harden the system against hallucination, prompt injection, and stale or unauthorized retrievals.

Design Semantic Search for Internal Knowledge

Hard

Generative AI & LLMs

Scenario

Constraints

p95 end-to-end latency must stay under 2,500ms for answer generation and under 500ms for search-only queries
Monthly serving budget must stay under $30K at 80K queries/day
Hallucinated or unsupported factual claims must be below 2% on a labeled evaluation set
Retrieved content may contain prompt injection attempts, stale guidance, and access-controlled material
Every generated answer must include source citations or refuse when evidence is insufficient

Available Resources

A corpus of 2 million internal documents with metadata, ACLs, and update timestamps
Access to an approved embedding model, a hosted LLM API, and a vector store that supports metadata filtering
A BM25 index and a lightweight reranker service already used by search infrastructure
Capacity to label 1,000 evaluation queries and run online experiments with a small internal user group

Question

Your Answer

Design Semantic Search for Internal Knowledge

Hard

Generative AI & LLMs

Scenario

Constraints

p95 end-to-end latency must stay under 2,500ms for answer generation and under 500ms for search-only queries
Monthly serving budget must stay under $30K at 80K queries/day
Hallucinated or unsupported factual claims must be below 2% on a labeled evaluation set
Retrieved content may contain prompt injection attempts, stale guidance, and access-controlled material
Every generated answer must include source citations or refuse when evidence is insufficient

Available Resources

A corpus of 2 million internal documents with metadata, ACLs, and update timestamps
Access to an approved embedding model, a hosted LLM API, and a vector store that supports metadata filtering
A BM25 index and a lightweight reranker service already used by search infrastructure
Capacity to label 1,000 evaluation queries and run online experiments with a small internal user group

Question

Design Semantic Search for Internal Knowledge

Hard

Generative AI & LLMs

Scenario

Constraints

p95 end-to-end latency must stay under 2,500ms for answer generation and under 500ms for search-only queries
Monthly serving budget must stay under $30K at 80K queries/day
Hallucinated or unsupported factual claims must be below 2% on a labeled evaluation set
Retrieved content may contain prompt injection attempts, stale guidance, and access-controlled material
Every generated answer must include source citations or refuse when evidence is insufficient

Available Resources

A corpus of 2 million internal documents with metadata, ACLs, and update timestamps
Access to an approved embedding model, a hosted LLM API, and a vector store that supports metadata filtering
A BM25 index and a lightweight reranker service already used by search infrastructure
Capacity to label 1,000 evaluation queries and run online experiments with a small internal user group