Context
FinSure, a B2B insurance software company, wants to launch an internal AI assistant that answers employee questions using policy manuals, compliance guides, and support runbooks. Leadership is not asking you to build the assistant from scratch; they want to know how you would determine whether the capability is actually ready for enterprise use.
Constraints
- p95 latency must stay under 2,500ms
- Cost ceiling: $0.03 per request and $25K/month at 1M requests/month
- Hallucination rate must be below 2% on a representative evaluation set
- Prompt injection success rate must be below 0.5%
- The system must refuse unsupported or policy-violating requests rather than guess
- All factual answers must be grounded in approved internal documents
Available Resources
- 200K internal documents (PDFs, markdown, policy pages, runbooks)
- Existing RAG pipeline with hybrid retrieval and source citations
- Access to one frontier model and one cheaper fallback model
- 5,000 historical employee questions, including escalations and user feedback
- Compliance team can label 300 high-risk examples for a golden set
- Security team can provide adversarial prompt-injection test cases
Task
- Design an evaluation-first enterprise readiness framework for this AI capability, including clear launch gates and non-negotiable failure thresholds.
- Define the offline evaluation plan: datasets, rubrics, segmentation, hallucination measurement, prompt-injection testing, and how you would calibrate any LLM-as-judge approach.
- Define the online evaluation plan: success metrics, guardrails, rollout stages, and rollback criteria after launch.
- Propose the minimum architecture and prompting changes needed to meet the evaluation bar, including grounded answering and refusal behavior.
- Estimate the cost/latency implications of your plan and explain what tradeoffs you would make if quality, safety, and budget conflict.