Context
BrightDesk is a B2B customer-support platform testing an LLM-powered support workflow for small-business customers. The workflow drafts replies, retrieves help-center content, and suggests next actions to agents; leadership wants to know whether it actually helps customers succeed rather than just reducing handle time.
Constraints
- p95 end-to-end latency: 2,500ms per assistant turn
- Cost ceiling: $0.035 per assisted conversation turn, or $45K/month at 1.4M turns
- Hallucination ceiling: <2% of responses on a labeled evaluation set may contain unsupported policy or product claims
- Safety: must resist prompt injection from customer messages or retrieved docs, avoid leaking PII, and refuse when evidence is insufficient
- Business goal: improve customer task success without increasing reopen rate, escalations, or compliance incidents
Available Resources
- 18 months of support conversations with outcomes: resolved/not resolved, reopen within 7 days, CSAT, refund issued, escalation, and retention at 30 days
- Product help-center articles, internal support macros, policy docs, and troubleshooting guides
- Event logs for agent actions and customer follow-up behavior
- A baseline workflow without LLM assistance and a current pilot using retrieval + generation
- Access to GPT-4.1-mini or Claude Sonnet-class models, embeddings, and a hybrid search index
Task
- Define an evaluation framework that determines whether the LLM workflow improves customer success, not just agent efficiency. Specify primary metrics, guardrails, and how you would segment results by issue type and customer tier.
- Design the offline evaluation suite first: golden set construction, hallucination and faithfulness checks, prompt-injection tests, and how you would calibrate any LLM-as-judge rubric against human labels.
- Propose the online evaluation plan: experiment design, success criteria, unit of randomization, and how to handle confounders such as agent learning effects and issue-mix shifts.
- Outline the workflow architecture and prompt strategy needed to support the evaluation goals, including grounded answering, citation requirements, and refusal behavior.
- Estimate cost and latency, then explain what you would simplify or change if the system misses either budget while preserving customer-success gains.