Context
Intuit wants to add a generative AI assistant inside TurboTax and QuickBooks to answer user questions about tax guidance, bookkeeping workflows, and account-specific product help. The feature must operate in a regulated financial environment where incorrect or non-compliant answers can create legal, trust, and customer-support risk.
Constraints
- p95 latency: 2,500ms for a grounded answer
- Cost ceiling: $0.03 per request and $150K/month at 5M requests/month
- Hallucination ceiling: <1% on a high-risk golden set for tax/compliance questions
- Must cite approved sources for factual claims
- Must refuse or escalate when the answer depends on missing user context, regulated advice boundaries, or unsupported claims
- Must defend against prompt injection, PII leakage, and unauthorized retrieval across customer accounts
Available Resources
- Approved corpora: TurboTax help center, IRS publications, Intuit policy docs, QuickBooks support articles, internal compliance-approved response templates
- Structured metadata: doc version, jurisdiction, tax year, product surface, approval status, sensitivity label
- Models: one high-quality LLM, one lower-cost LLM, embeddings model, reranker
- Existing identity and authorization layer for QuickBooks/TurboTax users
- Compliance team can label 1,000 high-risk prompts and review monthly regressions
Task
- Design an eval-first LLM system for this assistant, including how you would measure factuality, refusal quality, prompt-injection robustness, and compliance before launch.
- Propose the RAG architecture and prompt design needed to keep answers grounded in approved financial content while meeting latency and cost constraints.
- Explain how you would mitigate the main deployment risks in a regulated environment: hallucinations, stale guidance, prompt injection, PII exposure, cross-tenant data leakage, and overconfident advice.
- Define the online monitoring and rollout plan, including guardrails, escalation paths to human support or CPA/tax expert workflows, and rollback criteria.
- Estimate cost/latency tradeoffs and identify where you would use smaller vs larger models.