Diagnose RAG Failures by Layer

Context

FinSure offers an internal support copilot that answers customer-service questions using policy manuals, billing procedures, and compliance FAQs. A large enterprise customer reports that answer quality is inconsistent, but it is unclear whether the root cause is the prompt, retrieval pipeline, or underlying model.

Constraints

p95 latency: 2,500ms end-to-end
Cost ceiling: $0.03 per request and $18K/month at 20K requests/day
Hallucination rate: <2% on a labeled golden set
Prompt injection success rate: <1% on adversarial tests
The system must cite retrieved sources for factual claims and refuse when evidence is insufficient

Available Resources

120K internal documents (PDFs, HTML help center pages, policy docs, ticket macros)
Existing hybrid search stack (BM25 + dense vector search) and document metadata
Three approved models: a small fast model, a mid-tier model, and a premium model
400 historical customer questions with human-rated answers, plus 50 known-bad examples from the customer
Access to prompt templates, retrieval logs, ranked results, and model outputs

Task

Propose a step-by-step diagnosis plan that isolates whether failures are caused by prompt design, retrieval quality, or model capability.
Define an offline and online evaluation strategy, including how you would build a golden set, measure groundedness, and detect prompt injection or unsupported answers.
Recommend an architecture and experimentation plan to test prompt-only, retrieval-only, and model-only changes while respecting latency and cost constraints.
Write a production-quality prompt and Python implementation for a diagnostic harness that runs ablations and returns structured root-cause signals.
List the most important failure modes, mitigations, and tradeoffs you would communicate to the customer and internal stakeholders.

Context

Constraints

p95 latency: 2,500ms end-to-end
Cost ceiling: $0.03 per request and $18K/month at 20K requests/day
Hallucination rate: <2% on a labeled golden set
Prompt injection success rate: <1% on adversarial tests
The system must cite retrieved sources for factual claims and refuse when evidence is insufficient

Available Resources

120K internal documents (PDFs, HTML help center pages, policy docs, ticket macros)
Existing hybrid search stack (BM25 + dense vector search) and document metadata
Three approved models: a small fast model, a mid-tier model, and a premium model
400 historical customer questions with human-rated answers, plus 50 known-bad examples from the customer
Access to prompt templates, retrieval logs, ranked results, and model outputs

Task

Propose a step-by-step diagnosis plan that isolates whether failures are caused by prompt design, retrieval quality, or model capability.
Define an offline and online evaluation strategy, including how you would build a golden set, measure groundedness, and detect prompt injection or unsupported answers.
Recommend an architecture and experimentation plan to test prompt-only, retrieval-only, and model-only changes while respecting latency and cost constraints.
Write a production-quality prompt and Python implementation for a diagnostic harness that runs ablations and returns structured root-cause signals.
List the most important failure modes, mitigations, and tradeoffs you would communicate to the customer and internal stakeholders.

Context

Constraints

p95 latency: 2,500ms end-to-end
Cost ceiling: $0.03 per request and $18K/month at 20K requests/day
Hallucination rate: <2% on a labeled golden set
Prompt injection success rate: <1% on adversarial tests
The system must cite retrieved sources for factual claims and refuse when evidence is insufficient

Available Resources

120K internal documents (PDFs, HTML help center pages, policy docs, ticket macros)
Existing hybrid search stack (BM25 + dense vector search) and document metadata
Three approved models: a small fast model, a mid-tier model, and a premium model
400 historical customer questions with human-rated answers, plus 50 known-bad examples from the customer
Access to prompt templates, retrieval logs, ranked results, and model outputs

Task

Propose a step-by-step diagnosis plan that isolates whether failures are caused by prompt design, retrieval quality, or model capability.
Define an offline and online evaluation strategy, including how you would build a golden set, measure groundedness, and detect prompt injection or unsupported answers.
Recommend an architecture and experimentation plan to test prompt-only, retrieval-only, and model-only changes while respecting latency and cost constraints.
Write a production-quality prompt and Python implementation for a diagnostic harness that runs ablations and returns structured root-cause signals.
List the most important failure modes, mitigations, and tradeoffs you would communicate to the customer and internal stakeholders.

Context

Constraints

p95 latency: 2,500ms end-to-end
Cost ceiling: $0.03 per request and $18K/month at 20K requests/day
Hallucination rate: <2% on a labeled golden set
Prompt injection success rate: <1% on adversarial tests
The system must cite retrieved sources for factual claims and refuse when evidence is insufficient

Available Resources

120K internal documents (PDFs, HTML help center pages, policy docs, ticket macros)
Existing hybrid search stack (BM25 + dense vector search) and document metadata
Three approved models: a small fast model, a mid-tier model, and a premium model
400 historical customer questions with human-rated answers, plus 50 known-bad examples from the customer
Access to prompt templates, retrieval logs, ranked results, and model outputs

Task

Propose a step-by-step diagnosis plan that isolates whether failures are caused by prompt design, retrieval quality, or model capability.
Define an offline and online evaluation strategy, including how you would build a golden set, measure groundedness, and detect prompt injection or unsupported answers.
Recommend an architecture and experimentation plan to test prompt-only, retrieval-only, and model-only changes while respecting latency and cost constraints.
Write a production-quality prompt and Python implementation for a diagnostic harness that runs ablations and returns structured root-cause signals.
List the most important failure modes, mitigations, and tradeoffs you would communicate to the customer and internal stakeholders.

Interview Guides

Context

Constraints

Available Resources

Task

Diagnose RAG Failures by Layer

Context

Constraints

Available Resources

Task

Your Answer

Diagnose RAG Failures by Layer

Context

Constraints

Available Resources

Task

Diagnose RAG Failures by Layer

Context

Constraints

Available Resources

Task

Your Answer