Evaluate Intuit AI Tax Assistant

Context

Intuit is preparing to launch a generative AI assistant inside TurboTax that answers user questions about filing flow, tax forms, and product guidance during return preparation. Your job is to decide whether the feature is safe and useful enough to ship, not just whether the model sounds good in demos.

Constraints

p95 latency: < 2,500 ms per response
Cost ceiling: < $0.03 per response and < $250K/month at projected peak season volume
Hallucination ceiling: < 1.5% on a high-risk evaluation set
Unsafe advice rate (fabricated tax guidance, unsupported legal/tax claims): < 0.5%
Prompt injection success rate: ~0% on adversarial tests
Must avoid exposing PII or returning advice outside approved TurboTax content and policy boundaries

Available Data / Models

Historical anonymized TurboTax help-center articles, product FAQs, and approved tax guidance snippets
20K labeled support conversations with resolution outcomes
A red-team set of adversarial prompts, jailbreaks, and prompt-injection attempts
Human reviewers from tax, legal, and customer support operations
Access to an approved LLM provider and a smaller fallback model for low-risk traffic

Task

Define a ship-readiness evaluation framework for this TurboTax AI feature, including what “safe” and “useful” mean in measurable terms.
Propose an offline evaluation plan before launch: golden sets, adversarial tests, human review, LLM-as-judge calibration, and programmatic checks.
Propose an online evaluation plan after launch: experiment design, success metrics, guardrails, escalation criteria, and rollback triggers.
Design the minimum production architecture needed to support the evaluation goals, including prompt design, grounding strategy, and safety controls.
Estimate cost and latency, and explain what tradeoffs you would make if the system misses either the safety bar or the budget.

Be explicit about how you would measure hallucination, refusal quality, prompt-injection resistance, and whether the assistant actually helps users complete filing tasks faster or with fewer support contacts.

Context

Constraints

p95 latency: < 2,500 ms per response
Cost ceiling: < $0.03 per response and < $250K/month at projected peak season volume
Hallucination ceiling: < 1.5% on a high-risk evaluation set
Unsafe advice rate (fabricated tax guidance, unsupported legal/tax claims): < 0.5%
Prompt injection success rate: ~0% on adversarial tests
Must avoid exposing PII or returning advice outside approved TurboTax content and policy boundaries

Available Data / Models

Historical anonymized TurboTax help-center articles, product FAQs, and approved tax guidance snippets
20K labeled support conversations with resolution outcomes
A red-team set of adversarial prompts, jailbreaks, and prompt-injection attempts
Human reviewers from tax, legal, and customer support operations
Access to an approved LLM provider and a smaller fallback model for low-risk traffic

Task

Define a ship-readiness evaluation framework for this TurboTax AI feature, including what “safe” and “useful” mean in measurable terms.
Propose an offline evaluation plan before launch: golden sets, adversarial tests, human review, LLM-as-judge calibration, and programmatic checks.
Propose an online evaluation plan after launch: experiment design, success metrics, guardrails, escalation criteria, and rollback triggers.
Design the minimum production architecture needed to support the evaluation goals, including prompt design, grounding strategy, and safety controls.
Estimate cost and latency, and explain what tradeoffs you would make if the system misses either the safety bar or the budget.

Context

Constraints

p95 latency: < 2,500 ms per response
Cost ceiling: < $0.03 per response and < $250K/month at projected peak season volume
Hallucination ceiling: < 1.5% on a high-risk evaluation set
Unsafe advice rate (fabricated tax guidance, unsupported legal/tax claims): < 0.5%
Prompt injection success rate: ~0% on adversarial tests
Must avoid exposing PII or returning advice outside approved TurboTax content and policy boundaries

Available Data / Models

Historical anonymized TurboTax help-center articles, product FAQs, and approved tax guidance snippets
20K labeled support conversations with resolution outcomes
A red-team set of adversarial prompts, jailbreaks, and prompt-injection attempts
Human reviewers from tax, legal, and customer support operations
Access to an approved LLM provider and a smaller fallback model for low-risk traffic

Task

Define a ship-readiness evaluation framework for this TurboTax AI feature, including what “safe” and “useful” mean in measurable terms.
Propose an offline evaluation plan before launch: golden sets, adversarial tests, human review, LLM-as-judge calibration, and programmatic checks.
Propose an online evaluation plan after launch: experiment design, success metrics, guardrails, escalation criteria, and rollback triggers.
Design the minimum production architecture needed to support the evaluation goals, including prompt design, grounding strategy, and safety controls.
Estimate cost and latency, and explain what tradeoffs you would make if the system misses either the safety bar or the budget.

Context

Constraints

p95 latency: < 2,500 ms per response
Cost ceiling: < $0.03 per response and < $250K/month at projected peak season volume
Hallucination ceiling: < 1.5% on a high-risk evaluation set
Unsafe advice rate (fabricated tax guidance, unsupported legal/tax claims): < 0.5%
Prompt injection success rate: ~0% on adversarial tests
Must avoid exposing PII or returning advice outside approved TurboTax content and policy boundaries

Available Data / Models

Historical anonymized TurboTax help-center articles, product FAQs, and approved tax guidance snippets
20K labeled support conversations with resolution outcomes
A red-team set of adversarial prompts, jailbreaks, and prompt-injection attempts
Human reviewers from tax, legal, and customer support operations
Access to an approved LLM provider and a smaller fallback model for low-risk traffic

Task

Define a ship-readiness evaluation framework for this TurboTax AI feature, including what “safe” and “useful” mean in measurable terms.
Propose an offline evaluation plan before launch: golden sets, adversarial tests, human review, LLM-as-judge calibration, and programmatic checks.
Propose an online evaluation plan after launch: experiment design, success metrics, guardrails, escalation criteria, and rollback triggers.
Design the minimum production architecture needed to support the evaluation goals, including prompt design, grounding strategy, and safety controls.
Estimate cost and latency, and explain what tradeoffs you would make if the system misses either the safety bar or the budget.

Interview Guides

Context

Constraints

Available Data / Models

Task

Evaluate Intuit AI Tax Assistant

Context

Constraints

Available Data / Models

Task

Your Answer

Evaluate Intuit AI Tax Assistant

Context

Constraints

Available Data / Models

Task

Evaluate Intuit AI Tax Assistant

Context

Constraints

Available Data / Models

Task

Your Answer