Context
BrightDesk is building an AI copilot for customer support agents. The feature should draft grounded answers to customer questions using help-center content, policy docs, and historical resolved tickets, while adapting to BrightDesk's tone and workflow.
Constraints
- p95 latency: 1,500ms for agent-facing suggestions
- Cost ceiling: $12K/month at 300K requests/month
- Hallucination ceiling: <2% on policy-related questions
- Must not reveal internal-only notes or customer PII
- Prompt injection success rate from retrieved content should be near 0% on adversarial tests
- Engineering team can support one production system in the next 8 weeks; avoid unnecessary complexity
Available Resources
- 40K public help-center articles and policy pages
- 2M historical support tickets with resolution labels; only 300K are high quality
- Existing search stack supports BM25 and vector search
- Access to GPT-4.1-mini / GPT-4.1 class models and embedding APIs
- 1,000 manually labeled evaluation examples across billing, refunds, outages, and account access
- Human reviewers from support operations can label another 200 examples if needed
Task
- Propose a decision framework for when BrightDesk should use prompt engineering only, RAG, or fine-tuning, and explain the trade-offs across quality, latency, cost, maintainability, and safety.
- Design an initial production approach for the first launch, including prompt strategy, retrieval or training plan, and fallback/refusal behavior.
- Define an evaluation-first plan: offline benchmarks before launch and online metrics after launch. Be explicit about how you will measure hallucination, groundedness, style adherence, and prompt-injection robustness.
- Estimate cost and latency for your recommended approach, and explain what you would cut or change if the system misses either budget.
- Identify the top failure modes, including cases where retrieval hurts quality or fine-tuning creates stale behavior, and propose mitigations.