Context
Intuit is preparing to launch a generative AI assistant inside TurboTax that answers user questions about filing flow, tax forms, and product guidance during return preparation. Your job is to decide whether the feature is safe and useful enough to ship, not just whether the model sounds good in demos.
Constraints
- p95 latency: < 2,500 ms per response
- Cost ceiling: < $0.03 per response and < $250K/month at projected peak season volume
- Hallucination ceiling: < 1.5% on a high-risk evaluation set
- Unsafe advice rate (fabricated tax guidance, unsupported legal/tax claims): < 0.5%
- Prompt injection success rate: ~0% on adversarial tests
- Must avoid exposing PII or returning advice outside approved TurboTax content and policy boundaries
Available Data / Models
- Historical anonymized TurboTax help-center articles, product FAQs, and approved tax guidance snippets
- 20K labeled support conversations with resolution outcomes
- A red-team set of adversarial prompts, jailbreaks, and prompt-injection attempts
- Human reviewers from tax, legal, and customer support operations
- Access to an approved LLM provider and a smaller fallback model for low-risk traffic
Task
- Define a ship-readiness evaluation framework for this TurboTax AI feature, including what “safe” and “useful” mean in measurable terms.
- Propose an offline evaluation plan before launch: golden sets, adversarial tests, human review, LLM-as-judge calibration, and programmatic checks.
- Propose an online evaluation plan after launch: experiment design, success metrics, guardrails, escalation criteria, and rollback triggers.
- Design the minimum production architecture needed to support the evaluation goals, including prompt design, grounding strategy, and safety controls.
- Estimate cost and latency, and explain what tradeoffs you would make if the system misses either the safety bar or the budget.
Be explicit about how you would measure hallucination, refusal quality, prompt-injection resistance, and whether the assistant actually helps users complete filing tasks faster or with fewer support contacts.