Monitor RAG Assistant Failure Modes

Scenario

You are responsible for an internal assistant that answers analyst and engineer questions over a large corpus of operational documents, tickets, and runbooks. The system already uses retrieval-augmented generation and is being rolled out to a few hundred daily users, with plans to expand quickly. Early feedback shows the assistant is often helpful, but a small number of answers contain unsupported claims, and security reviewers are concerned about prompt injection hidden inside retrieved content. You need a practical evaluation and monitoring plan before broader launch.

Constraints

p95 end-to-end latency must stay under 2,500ms
Total inference and retrieval cost must stay under $0.03 per request
Hallucination rate must remain below 2% on a curated evaluation set
Prompt injection success rate must be near 0 on adversarial tests
The system must prefer refusal over unsupported answers

Available Resources

A production RAG pipeline with document metadata, citations, and request logs
Access to a GPT-4-class model and a smaller low-cost model for secondary checks
2,000 historical user queries and 40 hours/month of expert labeling capacity
A staging environment where you can run shadow traffic and offline evals

Question

How would you evaluate and monitor this LLM system for hallucinations, prompt injection, and other important failure modes while keeping latency and cost within budget, and how would you use those results to decide when the system is safe to expand?

Scenario

Constraints

p95 end-to-end latency must stay under 2,500ms

Total inference and retrieval cost must stay under $0.03 per request

Hallucination rate must remain below 2% on a curated evaluation set

Prompt injection success rate must be near 0 on adversarial tests

The system must prefer refusal over unsupported answers

Available Resources

A production RAG pipeline with document metadata, citations, and request logs

Access to a GPT-4-class model and a smaller low-cost model for secondary checks

2,000 historical user queries and 40 hours/month of expert labeling capacity

A staging environment where you can run shadow traffic and offline evals

Scenario

Constraints

p95 end-to-end latency must stay under 2,500ms

Total inference and retrieval cost must stay under $0.03 per request

Hallucination rate must remain below 2% on a curated evaluation set

Prompt injection success rate must be near 0 on adversarial tests

The system must prefer refusal over unsupported answers

Available Resources

A production RAG pipeline with document metadata, citations, and request logs

Access to a GPT-4-class model and a smaller low-cost model for secondary checks

2,000 historical user queries and 40 hours/month of expert labeling capacity

A staging environment where you can run shadow traffic and offline evals

Scenario

Constraints

p95 end-to-end latency must stay under 2,500ms

Total inference and retrieval cost must stay under $0.03 per request

Hallucination rate must remain below 2% on a curated evaluation set

Prompt injection success rate must be near 0 on adversarial tests

The system must prefer refusal over unsupported answers

Available Resources

A production RAG pipeline with document metadata, citations, and request logs

Access to a GPT-4-class model and a smaller low-cost model for secondary checks

2,000 historical user queries and 40 hours/month of expert labeling capacity

A staging environment where you can run shadow traffic and offline evals

Interview Guides

Scenario

Constraints

Available Resources

Question

Monitor RAG Assistant Failure Modes

Scenario

Constraints

Available Resources

Question

Your Answer

Monitor RAG Assistant Failure Modes

Scenario

Constraints

Available Resources

Question

Monitor RAG Assistant Failure Modes

Scenario

Constraints

Available Resources

Question

Your Answer