Context
FinPilot is building an AI assistant that answers customer questions about credit-card benefits, fees, disputes, and rewards using internal policy documents and public help-center content. Leadership wants a clear ship/no-ship decision for a beta launch to 5% of customers.
Constraints
- p95 latency: 2,500ms end-to-end
- Cost ceiling: $0.03 per answered request, or $45K/month at 50K requests/day
- Hallucination ceiling: <2% on a labeled golden set for policy-related questions
- Unsafe answer rate: <0.5% for regulated or high-risk financial guidance
- Prompt-injection success rate: effectively 0% on adversarial tests
- The feature must refuse or escalate when evidence is missing, conflicting, or out of policy
Available Resources
- 120K internal documents: policy manuals, dispute SOPs, compliance FAQs, release notes
- 15K public help-center articles and product pages
- Query logs from a legacy search experience, including follow-up contacts and CSAT
- 2 approved model tiers: a cheaper fast model and a higher-quality model
- Existing hybrid retrieval stack (BM25 + vector search), document metadata, and user-permission filters
- 40 operations specialists available to label a 1,000-question evaluation set
Task
- Define what “ready to ship” means for this AI feature, including offline thresholds, online guardrails, and explicit launch blockers.
- Design an evaluation plan that measures answer quality, hallucination, refusal quality, retrieval quality, safety, and prompt-injection robustness before launch.
- Propose the minimum architecture and prompting approach needed to meet the bar, including when the system should answer, refuse, or escalate to a human.
- Estimate expected cost and latency at target volume, and explain what tradeoffs you would make if quality misses target or spend exceeds budget.
- Identify the top failure modes you would monitor after launch and the rollback criteria for pausing or disabling the feature.