Context
FinMate, a personal finance app, uses an LLM assistant to answer user questions about transactions, budgets, and help-center policies. The current system depends on a single external model provider, and response quality drops sharply when that provider is slow or unavailable.
Constraints
- p95 end-to-end latency: < 1,500ms during normal operation, < 2,000ms during provider degradation
- Availability SLO: 99.9% successful responses
- Cost ceiling: $12K/month at 1.2M requests/month
- Hallucination rate on a 400-question golden set: < 2% for policy/account answers
- For unsupported or degraded scenarios, the system should prefer safe partial answers or explicit fallback/refusal over guessing
- Must resist prompt injection from retrieved help-center content or user input
- No sensitive account data may be logged in raw form
Available Resources
- 25K help-center and policy documents, updated daily
- Structured product metadata and a small library of deterministic backend APIs (account status, transaction lookup, card limits)
- Access to two LLM providers: one high-quality but slower model, and one cheaper/faster model
- Redis or equivalent cache, feature flags, and standard observability tooling
- Historical logs with latency, provider errors, and user feedback
Task
- Design a response architecture that keeps answers fast when the primary provider is slow, rate-limited, or unavailable. Include caching, model fallback, and any degraded-mode behavior.
- Specify the prompting strategy for grounded answers, safe refusals, and handling prompt injection in retrieved content.
- Define an evaluation-first plan: offline tests for latency/quality/safety and online metrics or experiments to validate the design.
- Estimate cost and latency for normal mode and degraded mode, and explain how routing decisions balance both.
- Identify the main failure modes, including hallucinations during fallback and stale/unsafe cached responses, and propose mitigations.