Context
BrightDesk runs an AI assistant that drafts replies for customer-support agents using help-center articles, policy docs, and recent ticket history. Leadership wants to reduce LLM spend materially because usage has grown faster than expected, but support quality cannot regress.
Constraints
- Current volume: 1.2M draft generations/month
- Current average cost: $0.018 per request
- Target: reduce total LLM cost by at least 50%
- p95 latency must stay under 1,800ms
- Hallucination rate must remain below 2.5% on a labeled support set
- Prompt-injection success rate must be below 1% on adversarial tests
- Escalation-to-human rate cannot worsen by more than 1 percentage point after launch
Available Resources
- 80K historical support tickets with agent-written final responses
- 6K help-center and policy documents, already permission-scoped
- Existing baseline: a single large model call with a long prompt containing ticket text, account metadata, and top-10 retrieved docs
- Access to one large model, one mid-tier model, and one small model from an approved provider
- Existing evaluation assets: 800 labeled prompts with rubric scores for correctness, groundedness, tone, and policy compliance
Task
- Propose a cost-reduction plan for the drafting system. Be specific about what changed: prompt compression, retrieval changes, model routing, caching, structured outputs, or a smaller model for some requests.
- Define an evaluation-first rollout plan. Specify offline and online metrics, acceptance thresholds, and how you would prove quality was not harmed.
- Design the prompt and serving architecture, including how you keep hallucination and prompt injection risk low while shrinking token usage.
- Estimate cost and latency before vs. after your changes, including assumptions about token counts, routing percentages, and monthly volume.
- Identify the main failure modes of your optimization plan and how you would detect and mitigate them in production.