Enterprise Readiness Evaluation for AI Assistant

Context

FinSure, a B2B insurance software company, wants to launch an internal AI assistant that answers employee questions using policy manuals, compliance guides, and support runbooks. Leadership is not asking you to build the assistant from scratch; they want to know how you would determine whether the capability is actually ready for enterprise use.

Constraints

p95 latency must stay under 2,500ms
Cost ceiling: $0.03 per request and $25K/month at 1M requests/month
Hallucination rate must be below 2% on a representative evaluation set
Prompt injection success rate must be below 0.5%
The system must refuse unsupported or policy-violating requests rather than guess
All factual answers must be grounded in approved internal documents

Available Resources

200K internal documents (PDFs, markdown, policy pages, runbooks)
Existing RAG pipeline with hybrid retrieval and source citations
Access to one frontier model and one cheaper fallback model
5,000 historical employee questions, including escalations and user feedback
Compliance team can label 300 high-risk examples for a golden set
Security team can provide adversarial prompt-injection test cases

Task

Design an evaluation-first enterprise readiness framework for this AI capability, including clear launch gates and non-negotiable failure thresholds.
Define the offline evaluation plan: datasets, rubrics, segmentation, hallucination measurement, prompt-injection testing, and how you would calibrate any LLM-as-judge approach.
Define the online evaluation plan: success metrics, guardrails, rollout stages, and rollback criteria after launch.
Propose the minimum architecture and prompting changes needed to meet the evaluation bar, including grounded answering and refusal behavior.
Estimate the cost/latency implications of your plan and explain what tradeoffs you would make if quality, safety, and budget conflict.

Context

Constraints

p95 latency must stay under 2,500ms
Cost ceiling: $0.03 per request and $25K/month at 1M requests/month
Hallucination rate must be below 2% on a representative evaluation set
Prompt injection success rate must be below 0.5%
The system must refuse unsupported or policy-violating requests rather than guess
All factual answers must be grounded in approved internal documents

Available Resources

200K internal documents (PDFs, markdown, policy pages, runbooks)
Existing RAG pipeline with hybrid retrieval and source citations
Access to one frontier model and one cheaper fallback model
5,000 historical employee questions, including escalations and user feedback
Compliance team can label 300 high-risk examples for a golden set
Security team can provide adversarial prompt-injection test cases

Task

Design an evaluation-first enterprise readiness framework for this AI capability, including clear launch gates and non-negotiable failure thresholds.
Define the offline evaluation plan: datasets, rubrics, segmentation, hallucination measurement, prompt-injection testing, and how you would calibrate any LLM-as-judge approach.
Define the online evaluation plan: success metrics, guardrails, rollout stages, and rollback criteria after launch.
Propose the minimum architecture and prompting changes needed to meet the evaluation bar, including grounded answering and refusal behavior.
Estimate the cost/latency implications of your plan and explain what tradeoffs you would make if quality, safety, and budget conflict.

Context

Constraints

p95 latency must stay under 2,500ms
Cost ceiling: $0.03 per request and $25K/month at 1M requests/month
Hallucination rate must be below 2% on a representative evaluation set
Prompt injection success rate must be below 0.5%
The system must refuse unsupported or policy-violating requests rather than guess
All factual answers must be grounded in approved internal documents

Available Resources

200K internal documents (PDFs, markdown, policy pages, runbooks)
Existing RAG pipeline with hybrid retrieval and source citations
Access to one frontier model and one cheaper fallback model
5,000 historical employee questions, including escalations and user feedback
Compliance team can label 300 high-risk examples for a golden set
Security team can provide adversarial prompt-injection test cases

Task

Design an evaluation-first enterprise readiness framework for this AI capability, including clear launch gates and non-negotiable failure thresholds.
Define the offline evaluation plan: datasets, rubrics, segmentation, hallucination measurement, prompt-injection testing, and how you would calibrate any LLM-as-judge approach.
Define the online evaluation plan: success metrics, guardrails, rollout stages, and rollback criteria after launch.
Propose the minimum architecture and prompting changes needed to meet the evaluation bar, including grounded answering and refusal behavior.
Estimate the cost/latency implications of your plan and explain what tradeoffs you would make if quality, safety, and budget conflict.

Context

Constraints

p95 latency must stay under 2,500ms
Cost ceiling: $0.03 per request and $25K/month at 1M requests/month
Hallucination rate must be below 2% on a representative evaluation set
Prompt injection success rate must be below 0.5%
The system must refuse unsupported or policy-violating requests rather than guess
All factual answers must be grounded in approved internal documents

Available Resources

200K internal documents (PDFs, markdown, policy pages, runbooks)
Existing RAG pipeline with hybrid retrieval and source citations
Access to one frontier model and one cheaper fallback model
5,000 historical employee questions, including escalations and user feedback
Compliance team can label 300 high-risk examples for a golden set
Security team can provide adversarial prompt-injection test cases

Task

Design an evaluation-first enterprise readiness framework for this AI capability, including clear launch gates and non-negotiable failure thresholds.
Define the offline evaluation plan: datasets, rubrics, segmentation, hallucination measurement, prompt-injection testing, and how you would calibrate any LLM-as-judge approach.
Define the online evaluation plan: success metrics, guardrails, rollout stages, and rollback criteria after launch.
Propose the minimum architecture and prompting changes needed to meet the evaluation bar, including grounded answering and refusal behavior.
Estimate the cost/latency implications of your plan and explain what tradeoffs you would make if quality, safety, and budget conflict.

Interview Guides

Context

Constraints

Available Resources

Task

Enterprise Readiness Evaluation for AI Assistant

Context

Constraints

Available Resources

Task

Your Answer

Enterprise Readiness Evaluation for AI Assistant

Context

Constraints

Available Resources

Task

Enterprise Readiness Evaluation for AI Assistant

Context

Constraints

Available Resources

Task

Your Answer