Ship Readiness for AI Answers

Context

FinPilot is building an AI assistant that answers customer questions about credit-card benefits, fees, disputes, and rewards using internal policy documents and public help-center content. Leadership wants a clear ship/no-ship decision for a beta launch to 5% of customers.

Constraints

p95 latency: 2,500ms end-to-end
Cost ceiling: $0.03 per answered request, or $45K/month at 50K requests/day
Hallucination ceiling: <2% on a labeled golden set for policy-related questions
Unsafe answer rate: <0.5% for regulated or high-risk financial guidance
Prompt-injection success rate: effectively 0% on adversarial tests
The feature must refuse or escalate when evidence is missing, conflicting, or out of policy

Available Resources

120K internal documents: policy manuals, dispute SOPs, compliance FAQs, release notes
15K public help-center articles and product pages
Query logs from a legacy search experience, including follow-up contacts and CSAT
2 approved model tiers: a cheaper fast model and a higher-quality model
Existing hybrid retrieval stack (BM25 + vector search), document metadata, and user-permission filters
40 operations specialists available to label a 1,000-question evaluation set

Task

Define what “ready to ship” means for this AI feature, including offline thresholds, online guardrails, and explicit launch blockers.
Design an evaluation plan that measures answer quality, hallucination, refusal quality, retrieval quality, safety, and prompt-injection robustness before launch.
Propose the minimum architecture and prompting approach needed to meet the bar, including when the system should answer, refuse, or escalate to a human.
Estimate expected cost and latency at target volume, and explain what tradeoffs you would make if quality misses target or spend exceeds budget.
Identify the top failure modes you would monitor after launch and the rollback criteria for pausing or disabling the feature.

Constraints

p95 latency: 2,500ms end-to-end

Cost ceiling: $0.03 per answered request, or $45K/month at 50K requests/day

Hallucination ceiling: <2% on a labeled golden set for policy-related questions

Unsafe answer rate: <0.5% for regulated or high-risk financial guidance

Prompt-injection success rate: effectively 0% on adversarial tests

The feature must refuse or escalate when evidence is missing, conflicting, or out of policy

Available Resources

120K internal documents: policy manuals, dispute SOPs, compliance FAQs, release notes

15K public help-center articles and product pages

Query logs from a legacy search experience, including follow-up contacts and CSAT

2 approved model tiers: a cheaper fast model and a higher-quality model

Existing hybrid retrieval stack (BM25 + vector search), document metadata, and user-permission filters

40 operations specialists available to label a 1,000-question evaluation set

Task

Define what “ready to ship” means for this AI feature, including offline thresholds, online guardrails, and explicit launch blockers.

Design an evaluation plan that measures answer quality, hallucination, refusal quality, retrieval quality, safety, and prompt-injection robustness before launch.

Propose the minimum architecture and prompting approach needed to meet the bar, including when the system should answer, refuse, or escalate to a human.

Estimate expected cost and latency at target volume, and explain what tradeoffs you would make if quality misses target or spend exceeds budget.

Identify the top failure modes you would monitor after launch and the rollback criteria for pausing or disabling the feature.

Constraints

p95 latency: 2,500ms end-to-end

Cost ceiling: $0.03 per answered request, or $45K/month at 50K requests/day

Hallucination ceiling: <2% on a labeled golden set for policy-related questions

Unsafe answer rate: <0.5% for regulated or high-risk financial guidance

Prompt-injection success rate: effectively 0% on adversarial tests

The feature must refuse or escalate when evidence is missing, conflicting, or out of policy

Available Resources

120K internal documents: policy manuals, dispute SOPs, compliance FAQs, release notes

15K public help-center articles and product pages

Query logs from a legacy search experience, including follow-up contacts and CSAT

2 approved model tiers: a cheaper fast model and a higher-quality model

Existing hybrid retrieval stack (BM25 + vector search), document metadata, and user-permission filters

40 operations specialists available to label a 1,000-question evaluation set

Task

Define what “ready to ship” means for this AI feature, including offline thresholds, online guardrails, and explicit launch blockers.

Design an evaluation plan that measures answer quality, hallucination, refusal quality, retrieval quality, safety, and prompt-injection robustness before launch.

Propose the minimum architecture and prompting approach needed to meet the bar, including when the system should answer, refuse, or escalate to a human.

Estimate expected cost and latency at target volume, and explain what tradeoffs you would make if quality misses target or spend exceeds budget.

Identify the top failure modes you would monitor after launch and the rollback criteria for pausing or disabling the feature.

Constraints

p95 latency: 2,500ms end-to-end

Cost ceiling: $0.03 per answered request, or $45K/month at 50K requests/day

Hallucination ceiling: <2% on a labeled golden set for policy-related questions

Unsafe answer rate: <0.5% for regulated or high-risk financial guidance

Prompt-injection success rate: effectively 0% on adversarial tests

The feature must refuse or escalate when evidence is missing, conflicting, or out of policy

Available Resources

120K internal documents: policy manuals, dispute SOPs, compliance FAQs, release notes

15K public help-center articles and product pages

Query logs from a legacy search experience, including follow-up contacts and CSAT

2 approved model tiers: a cheaper fast model and a higher-quality model

Existing hybrid retrieval stack (BM25 + vector search), document metadata, and user-permission filters

40 operations specialists available to label a 1,000-question evaluation set

Task

Define what “ready to ship” means for this AI feature, including offline thresholds, online guardrails, and explicit launch blockers.

Design an evaluation plan that measures answer quality, hallucination, refusal quality, retrieval quality, safety, and prompt-injection robustness before launch.

Propose the minimum architecture and prompting approach needed to meet the bar, including when the system should answer, refuse, or escalate to a human.

Estimate expected cost and latency at target volume, and explain what tradeoffs you would make if quality misses target or spend exceeds budget.

Identify the top failure modes you would monitor after launch and the rollback criteria for pausing or disabling the feature.

Interview Guides

Context

Constraints

Available Resources

Task

Ship Readiness for AI Answers

Context

Constraints

Available Resources

Task

Your Answer

Ship Readiness for AI Answers

Context

Constraints

Available Resources

Task

Ship Readiness for AI Answers

Context

Constraints

Available Resources

Task

Your Answer