Context
FinFlow, a mid-market fintech, has a customer-facing AI assistant that answers account, billing, and product-policy questions inside its web app. The current assistant is helpful but occasionally fabricates policy details or unsupported troubleshooting steps, creating compliance and trust risk.
Constraints
- p95 latency must stay under 2,500ms end-to-end
- Cost ceiling: $0.035 per request and $45K/month at 1.2M requests
- Hallucination rate must be below 1.5% on a labeled customer-support golden set
- For questions not supported by approved sources, the assistant must refuse or escalate rather than guess
- Must resist prompt injection from user input and retrieved documents
- Responses must not expose PII or internal-only policy text
Available Resources
- 80K approved support articles, help-center pages, policy documents, and troubleshooting runbooks
- 18 months of historical support chats with resolution labels and escalation outcomes
- Product catalog metadata, account-state APIs, and a ticket-escalation tool
- Access to a production-approved LLM, an embedding model, and a hybrid search index
- 2,000 human-labeled evaluation examples, including unanswerable and adversarial prompts
Task
Design a workflow to reduce hallucinations in this assistant while preserving user experience.
- Propose an evaluation-first plan: define offline and online metrics, golden-set slices, and launch gates before describing the architecture.
- Design the end-to-end workflow, including prompt strategy, retrieval, grounding, refusal behavior, and when to call tools or escalate to a human.
- Explain how you would defend against hallucinated claims, unsupported citations, prompt injection, stale documents, and account-specific mistakes.
- Estimate cost and latency for your design, and describe what you would change if you were over either budget.
- Outline how you would monitor regressions after launch and safely iterate on prompts, retrieval, or models.