Investigate Poor LLM Answer Quality

Context

BrightDesk sells an AI support assistant for B2B SaaS teams. A customer reports that the assistant is "not giving the expected results" on their help-center and policy documents, but the complaint is vague: some answers are wrong, some are incomplete, and some seem slow.

Constraints

p95 latency must stay under 2,500ms
Cost ceiling: $0.03 per request and $25K/month at current traffic
Hallucination rate must be below 2% on customer-visible answers
Prompt injection success rate must be below 0.5%
The customer requires citations for factual answers and no leakage across tenants

Available Resources

120K customer documents across help articles, PDFs, release notes, and internal policy pages
Existing RAG stack: chunking, embeddings, hybrid search, reranker, and a GPT-4/Claude-class generation model
5K production traces with query, retrieved chunks, model output, latency, cost, and user feedback
Support tickets tagged as "bad answer," "missing context," or "slow"
Ability to run offline evals and limited online A/B tests

Task

Propose a step-by-step investigation plan to determine whether the issue is caused by retrieval quality, prompt design, model behavior, document quality, safety failures, or latency/cost tradeoffs.
Define the offline and online evaluation framework you would use before changing the architecture, including golden sets, segmentation, hallucination measurement, and prompt-injection testing.
Design the target RAG architecture and prompt changes you would recommend if the root cause is confirmed, including citation behavior and refusal rules.
Explain how you would quantify and prioritize fixes under the stated latency and cost ceilings.
Identify the main failure modes you would monitor in production and how you would mitigate them safely.

Context

Constraints

p95 latency must stay under 2,500ms
Cost ceiling: $0.03 per request and $25K/month at current traffic
Hallucination rate must be below 2% on customer-visible answers
Prompt injection success rate must be below 0.5%
The customer requires citations for factual answers and no leakage across tenants

Available Resources

120K customer documents across help articles, PDFs, release notes, and internal policy pages
Existing RAG stack: chunking, embeddings, hybrid search, reranker, and a GPT-4/Claude-class generation model
5K production traces with query, retrieved chunks, model output, latency, cost, and user feedback
Support tickets tagged as "bad answer," "missing context," or "slow"
Ability to run offline evals and limited online A/B tests

Task

Propose a step-by-step investigation plan to determine whether the issue is caused by retrieval quality, prompt design, model behavior, document quality, safety failures, or latency/cost tradeoffs.
Define the offline and online evaluation framework you would use before changing the architecture, including golden sets, segmentation, hallucination measurement, and prompt-injection testing.
Design the target RAG architecture and prompt changes you would recommend if the root cause is confirmed, including citation behavior and refusal rules.
Explain how you would quantify and prioritize fixes under the stated latency and cost ceilings.
Identify the main failure modes you would monitor in production and how you would mitigate them safely.

Context

Constraints

p95 latency must stay under 2,500ms
Cost ceiling: $0.03 per request and $25K/month at current traffic
Hallucination rate must be below 2% on customer-visible answers
Prompt injection success rate must be below 0.5%
The customer requires citations for factual answers and no leakage across tenants

Available Resources

120K customer documents across help articles, PDFs, release notes, and internal policy pages
Existing RAG stack: chunking, embeddings, hybrid search, reranker, and a GPT-4/Claude-class generation model
5K production traces with query, retrieved chunks, model output, latency, cost, and user feedback
Support tickets tagged as "bad answer," "missing context," or "slow"
Ability to run offline evals and limited online A/B tests

Task

Propose a step-by-step investigation plan to determine whether the issue is caused by retrieval quality, prompt design, model behavior, document quality, safety failures, or latency/cost tradeoffs.
Define the offline and online evaluation framework you would use before changing the architecture, including golden sets, segmentation, hallucination measurement, and prompt-injection testing.
Design the target RAG architecture and prompt changes you would recommend if the root cause is confirmed, including citation behavior and refusal rules.
Explain how you would quantify and prioritize fixes under the stated latency and cost ceilings.
Identify the main failure modes you would monitor in production and how you would mitigate them safely.

Context

Constraints

p95 latency must stay under 2,500ms
Cost ceiling: $0.03 per request and $25K/month at current traffic
Hallucination rate must be below 2% on customer-visible answers
Prompt injection success rate must be below 0.5%
The customer requires citations for factual answers and no leakage across tenants

Available Resources

120K customer documents across help articles, PDFs, release notes, and internal policy pages
Existing RAG stack: chunking, embeddings, hybrid search, reranker, and a GPT-4/Claude-class generation model
5K production traces with query, retrieved chunks, model output, latency, cost, and user feedback
Support tickets tagged as "bad answer," "missing context," or "slow"
Ability to run offline evals and limited online A/B tests

Task

Propose a step-by-step investigation plan to determine whether the issue is caused by retrieval quality, prompt design, model behavior, document quality, safety failures, or latency/cost tradeoffs.
Define the offline and online evaluation framework you would use before changing the architecture, including golden sets, segmentation, hallucination measurement, and prompt-injection testing.
Design the target RAG architecture and prompt changes you would recommend if the root cause is confirmed, including citation behavior and refusal rules.
Explain how you would quantify and prioritize fixes under the stated latency and cost ceilings.
Identify the main failure modes you would monitor in production and how you would mitigate them safely.

Interview Guides

Context

Constraints

Available Resources

Task

Investigate Poor LLM Answer Quality

Context

Constraints

Available Resources

Task

Your Answer

Investigate Poor LLM Answer Quality

Context

Constraints

Available Resources

Task

Investigate Poor LLM Answer Quality

Context

Constraints

Available Resources

Task

Your Answer