Context
FinPilot is adding an LLM-powered operations copilot that helps support agents resolve billing disputes. The assistant reads CRM notes, policy docs, and transaction history, then recommends the next action or drafts a customer response.
Constraints
- p95 end-to-end latency: 2,500ms
- Cost ceiling: $0.035 per request and $25K/month at 30K requests/day
- Hallucination ceiling: <2% on policy-grounded recommendations
- Unsafe action rate: 0% for irreversible actions (refund approval, account closure)
- Must resist prompt injection in CRM notes and customer-provided text
- Must not leak PII or hidden system instructions
Available Resources
- 80K internal policy and workflow documents, updated daily
- CRM tickets with agent notes, customer messages, and structured account metadata
- Tools: policy search, transaction lookup, refund eligibility checker, ticket escalation API
- Approved models: a small fast model for classification/routing and a stronger model for final reasoning
- 2,000 historical tickets with human-reviewed resolutions; 150 tickets can be labeled deeply for a golden set
Task
You are not being asked to only list generic risks. Design how you would think about failure modes when integrating this LLM into a real product workflow.
- Define the most important failure modes across retrieval, prompting, tool use, grounding, safety, and user interaction. Prioritize them by business impact and likelihood.
- Propose an eval-first plan: offline evals before launch and online monitoring after launch. Include how you would measure hallucinations, prompt injection success, unsafe tool calls, and silent quality regressions.
- Describe the architecture and control points you would add to prevent or contain failures, including when the system should refuse, ask for clarification, or escalate to a human.
- Estimate cost and latency for your design, including where you would use smaller vs larger models.
- Explain rollout strategy, guardrails, and rollback criteria if failure rates exceed thresholds.