Ship-Readiness Evaluation for AI Answers

Context

FinPilot is building an LLM-powered assistant that answers customer questions about credit-card benefits, fees, disputes, and account policies inside its mobile app. The team has a working prototype, but leadership will only ship if the feature is demonstrably useful and safe.

Constraints

p95 latency: 2,500ms end-to-end
Cost ceiling: $0.03 per request and $45K/month at 1.5M requests
Usefulness bar: at least 75% of evaluated answers are rated helpful on a representative test set
Safety bar: hallucination rate below 2%, prompt-injection success rate below 0.5%, and unsafe financial guidance rate below 0.1%
The assistant must refuse when policy information is missing, ambiguous, or unsupported
Responses must not leak PII or internal-only policy text

Available Resources

40K internal policy documents, help-center articles, and compliance FAQs
8K historical customer-support chats with human resolutions
1K manually labeled evaluation examples across common, ambiguous, and adversarial queries
Access to an approved GPT-4-class model and a cheaper small model for routing or grading
Product analytics events: follow-up question rate, escalation to human support, thumbs up/down

Task

Define a ship-readiness evaluation framework that determines whether the feature is safe and useful enough to launch, including explicit offline metrics, thresholds, and pass/fail criteria.
Design the online evaluation plan for a limited rollout, including success metrics, guardrails, segmentation, and rollback triggers.
Propose the minimum architecture and prompting approach needed to meet the evaluation bar, including how the system should refuse, cite evidence, and defend against prompt injection.
Explain how you would calibrate LLM-as-judge, where human review is required, and how you would monitor regressions after launch.
Estimate cost and latency for the evaluation and serving approach, and describe the main tradeoffs you would make if the system misses either budget.

Constraints

p95 latency: 2,500ms end-to-end

Cost ceiling: $0.03 per request and $45K/month at 1.5M requests

Usefulness bar: at least 75% of evaluated answers are rated helpful on a representative test set

Safety bar: hallucination rate below 2%, prompt-injection success rate below 0.5%, and unsafe financial guidance rate below 0.1%

The assistant must refuse when policy information is missing, ambiguous, or unsupported

Responses must not leak PII or internal-only policy text

Available Resources

40K internal policy documents, help-center articles, and compliance FAQs

8K historical customer-support chats with human resolutions

1K manually labeled evaluation examples across common, ambiguous, and adversarial queries

Access to an approved GPT-4-class model and a cheaper small model for routing or grading

Product analytics events: follow-up question rate, escalation to human support, thumbs up/down

Task

Define a ship-readiness evaluation framework that determines whether the feature is safe and useful enough to launch, including explicit offline metrics, thresholds, and pass/fail criteria.

Design the online evaluation plan for a limited rollout, including success metrics, guardrails, segmentation, and rollback triggers.

Propose the minimum architecture and prompting approach needed to meet the evaluation bar, including how the system should refuse, cite evidence, and defend against prompt injection.

Explain how you would calibrate LLM-as-judge, where human review is required, and how you would monitor regressions after launch.

Estimate cost and latency for the evaluation and serving approach, and describe the main tradeoffs you would make if the system misses either budget.

Constraints

p95 latency: 2,500ms end-to-end

Cost ceiling: $0.03 per request and $45K/month at 1.5M requests

Usefulness bar: at least 75% of evaluated answers are rated helpful on a representative test set

Safety bar: hallucination rate below 2%, prompt-injection success rate below 0.5%, and unsafe financial guidance rate below 0.1%

The assistant must refuse when policy information is missing, ambiguous, or unsupported

Responses must not leak PII or internal-only policy text

Available Resources

40K internal policy documents, help-center articles, and compliance FAQs

8K historical customer-support chats with human resolutions

1K manually labeled evaluation examples across common, ambiguous, and adversarial queries

Access to an approved GPT-4-class model and a cheaper small model for routing or grading

Product analytics events: follow-up question rate, escalation to human support, thumbs up/down

Task

Define a ship-readiness evaluation framework that determines whether the feature is safe and useful enough to launch, including explicit offline metrics, thresholds, and pass/fail criteria.

Design the online evaluation plan for a limited rollout, including success metrics, guardrails, segmentation, and rollback triggers.

Propose the minimum architecture and prompting approach needed to meet the evaluation bar, including how the system should refuse, cite evidence, and defend against prompt injection.

Explain how you would calibrate LLM-as-judge, where human review is required, and how you would monitor regressions after launch.

Estimate cost and latency for the evaluation and serving approach, and describe the main tradeoffs you would make if the system misses either budget.

Constraints

p95 latency: 2,500ms end-to-end

Cost ceiling: $0.03 per request and $45K/month at 1.5M requests

Usefulness bar: at least 75% of evaluated answers are rated helpful on a representative test set

Safety bar: hallucination rate below 2%, prompt-injection success rate below 0.5%, and unsafe financial guidance rate below 0.1%

The assistant must refuse when policy information is missing, ambiguous, or unsupported

Responses must not leak PII or internal-only policy text

Available Resources

40K internal policy documents, help-center articles, and compliance FAQs

8K historical customer-support chats with human resolutions

1K manually labeled evaluation examples across common, ambiguous, and adversarial queries

Access to an approved GPT-4-class model and a cheaper small model for routing or grading

Product analytics events: follow-up question rate, escalation to human support, thumbs up/down

Task

Define a ship-readiness evaluation framework that determines whether the feature is safe and useful enough to launch, including explicit offline metrics, thresholds, and pass/fail criteria.

Design the online evaluation plan for a limited rollout, including success metrics, guardrails, segmentation, and rollback triggers.

Propose the minimum architecture and prompting approach needed to meet the evaluation bar, including how the system should refuse, cite evidence, and defend against prompt injection.

Explain how you would calibrate LLM-as-judge, where human review is required, and how you would monitor regressions after launch.

Estimate cost and latency for the evaluation and serving approach, and describe the main tradeoffs you would make if the system misses either budget.

Interview Guides

Context

Constraints

Available Resources

Task

Ship-Readiness Evaluation for AI Answers

Context

Constraints

Available Resources

Task

Your Answer

Ship-Readiness Evaluation for AI Answers

Context

Constraints

Available Resources

Task

Ship-Readiness Evaluation for AI Answers

Context

Constraints

Available Resources

Task

Your Answer