Diagnose Underperforming Support Copilot

Context

BrightDesk has an LLM-powered customer-support copilot that drafts answers for human agents using the company help center, policy docs, and recent ticket history. A large enterprise customer says the product is "not performing as expected," but the complaint is vague: agents report low trust, inconsistent answers, and occasional policy mistakes.

Constraints

p95 latency: 2,500ms per draft
Cost ceiling: $12K/month at 400K draft generations
Accuracy bar: at least 85% acceptable drafts on a labeled eval set
Hallucination ceiling: <2% unsupported policy claims
Safety: must resist prompt injection from ticket text and retrieved docs; must not leak PII across tickets
The team needs a diagnosis plan within 1 week, not a full rebuild

Available Resources

50K historical support tickets with agent final replies and resolution codes
8K help-center and policy documents, versioned by date
Current system logs: user query, retrieved docs, model output, latency, token counts, thumbs up/down
One approved hosted LLM, one smaller cheaper model, embeddings API, and a hybrid search index
20 support QA reviewers available to label a golden set of 300 examples

Task

Describe what you would investigate first when a customer says the copilot is underperforming, and how you would separate retrieval, prompt, model, safety, and UX issues.
Define an eval-first diagnosis plan: offline datasets, rubrics, segmentation, and online metrics you would inspect before changing architecture.
Propose the minimum set of prompt, retrieval, and guardrail changes you would test first under the latency and cost limits.
Explain how you would detect hallucinations, prompt injection, stale-policy answers, and permission/PII leakage.
Estimate the likely cost/latency impact of your proposed changes and how you would decide whether to ship, rollback, or escalate to a larger redesign.

Context

Constraints

p95 latency: 2,500ms per draft

Cost ceiling: $12K/month at 400K draft generations

Accuracy bar: at least 85% acceptable drafts on a labeled eval set

Hallucination ceiling: <2% unsupported policy claims

Safety: must resist prompt injection from ticket text and retrieved docs; must not leak PII across tickets

The team needs a diagnosis plan within 1 week, not a full rebuild

Available Resources

50K historical support tickets with agent final replies and resolution codes

8K help-center and policy documents, versioned by date

Current system logs: user query, retrieved docs, model output, latency, token counts, thumbs up/down

One approved hosted LLM, one smaller cheaper model, embeddings API, and a hybrid search index

20 support QA reviewers available to label a golden set of 300 examples

Task

Describe what you would investigate first when a customer says the copilot is underperforming, and how you would separate retrieval, prompt, model, safety, and UX issues.

Define an eval-first diagnosis plan: offline datasets, rubrics, segmentation, and online metrics you would inspect before changing architecture.

Propose the minimum set of prompt, retrieval, and guardrail changes you would test first under the latency and cost limits.

Explain how you would detect hallucinations, prompt injection, stale-policy answers, and permission/PII leakage.

Estimate the likely cost/latency impact of your proposed changes and how you would decide whether to ship, rollback, or escalate to a larger redesign.

Problem

Context

Constraints

Available Resources

Task

Diagnose Underperforming Support Copilot

Problem

Context

Constraints

Available Resources

Task