Context
FinGlide, a personal finance app, has an LLM-powered support assistant that answers user questions about account features, fees, transaction disputes, and card policies. After launch, the team found that some answers are plausible but incorrect, especially when policy details changed recently or when the model answers beyond the provided knowledge base.
Constraints
- p95 latency must stay under 2,000ms
- Cost ceiling: $12K/month at 300K requests/month
- Hallucination rate must be reduced to <2% on a production-like golden set
- The assistant must refuse or escalate when evidence is insufficient
- Responses must not leak PII or follow malicious instructions embedded in retrieved content
- Existing UX expects a direct answer plus 1-3 cited sources
Available Resources
- 25K internal help-center articles, policy docs, and change logs
- 3 months of production logs with user query, retrieved docs, model answer, user feedback, and escalation outcome
- A small labeled set of 800 historical responses marked as correct / incorrect / unsupported
- Access to OpenAI models, embeddings, and a managed vector store
- Ability to change prompts, retrieval, routing, guardrails, and fallback behavior, but not the frontend experience
Task
- Propose an eval-first plan to diagnose why hallucinations are happening in production, including how you would separate retrieval failures, prompt failures, and model failures.
- Design a production mitigation strategy that improves factuality without breaking the latency and cost budgets. Be explicit about prompt changes, retrieval changes, refusal behavior, and when to escalate.
- Define offline and online evaluation, including golden-set design, adversarial testing for prompt injection, and rollout metrics.
- Provide a sample system prompt and a Python implementation for the answer pipeline, including structured output and basic citation validation.
- Identify key failure modes, safety risks, and the main cost/latency tradeoffs in your design.