Context
BrightDesk sells an AI support assistant for B2B SaaS teams. A customer reports that the assistant is "not giving the expected results" on their help-center and policy documents, but the complaint is vague: some answers are wrong, some are incomplete, and some seem slow.
Constraints
- p95 latency must stay under 2,500ms
- Cost ceiling: $0.03 per request and $25K/month at current traffic
- Hallucination rate must be below 2% on customer-visible answers
- Prompt injection success rate must be below 0.5%
- The customer requires citations for factual answers and no leakage across tenants
Available Resources
- 120K customer documents across help articles, PDFs, release notes, and internal policy pages
- Existing RAG stack: chunking, embeddings, hybrid search, reranker, and a GPT-4/Claude-class generation model
- 5K production traces with query, retrieved chunks, model output, latency, cost, and user feedback
- Support tickets tagged as "bad answer," "missing context," or "slow"
- Ability to run offline evals and limited online A/B tests
Task
- Propose a step-by-step investigation plan to determine whether the issue is caused by retrieval quality, prompt design, model behavior, document quality, safety failures, or latency/cost tradeoffs.
- Define the offline and online evaluation framework you would use before changing the architecture, including golden sets, segmentation, hallucination measurement, and prompt-injection testing.
- Design the target RAG architecture and prompt changes you would recommend if the root cause is confirmed, including citation behavior and refusal rules.
- Explain how you would quantify and prioritize fixes under the stated latency and cost ceilings.
- Identify the main failure modes you would monitor in production and how you would mitigate them safely.