Context
BrightDesk has an LLM-powered customer-support copilot that drafts answers for human agents using the company help center, policy docs, and recent ticket history. A large enterprise customer says the product is "not performing as expected," but the complaint is vague: agents report low trust, inconsistent answers, and occasional policy mistakes.
Constraints
- p95 latency: 2,500ms per draft
- Cost ceiling: $12K/month at 400K draft generations
- Accuracy bar: at least 85% acceptable drafts on a labeled eval set
- Hallucination ceiling: <2% unsupported policy claims
- Safety: must resist prompt injection from ticket text and retrieved docs; must not leak PII across tickets
- The team needs a diagnosis plan within 1 week, not a full rebuild
Available Resources
- 50K historical support tickets with agent final replies and resolution codes
- 8K help-center and policy documents, versioned by date
- Current system logs: user query, retrieved docs, model output, latency, token counts, thumbs up/down
- One approved hosted LLM, one smaller cheaper model, embeddings API, and a hybrid search index
- 20 support QA reviewers available to label a golden set of 300 examples
Task
- Describe what you would investigate first when a customer says the copilot is underperforming, and how you would separate retrieval, prompt, model, safety, and UX issues.
- Define an eval-first diagnosis plan: offline datasets, rubrics, segmentation, and online metrics you would inspect before changing architecture.
- Propose the minimum set of prompt, retrieval, and guardrail changes you would test first under the latency and cost limits.
- Explain how you would detect hallucinations, prompt injection, stale-policy answers, and permission/PII leakage.
- Estimate the likely cost/latency impact of your proposed changes and how you would decide whether to ship, rollback, or escalate to a larger redesign.