Context
OpenAI is piloting a Customer Success copilot that drafts answers for enterprise admins using help center articles, product docs, and prior resolved tickets. The team needs a rigorous way to determine whether the application is reliable enough to assist agents and, later, answer some low-risk questions directly.
Constraints
- p95 latency: 2,500ms per response in the agent-assist workflow
- Cost ceiling: $12K/month at 300K requests/month
- Reliability bar: at least 92% rubric pass rate on a golden set for in-scope questions
- Hallucination ceiling: fewer than 2% unsupported factual claims
- Safety: prompt injection success rate below 0.5%, no leakage of hidden instructions, and no exposure of customer PII in logs
- The system must prefer refusal / escalation over guessing when evidence is weak
Available Resources
- 15,000 OpenAI help center and product documentation pages with metadata (surface, product, last updated)
- 80,000 historical support tickets with final agent resolution notes
- OpenAI models for generation, embeddings, and structured outputs
- 20 Customer Success specialists available to label a golden set and calibrate graders
- Existing telemetry: user feedback, agent edits, escalation events, and time-to-resolution
Task
- Design an evaluation-first framework for this LLM application: define what “reliable” means for this customer use case, including offline metrics, online metrics, pass/fail thresholds, and segmentation.
- Propose the testing methodology for factuality, instruction-following, refusal behavior, prompt injection resistance, and consistency across common support intents.
- Specify the minimal application architecture needed to support that evaluation, including prompt design, retrieval assumptions if used, and structured logging for audits.
- Estimate cost and latency for your proposed eval and serving setup, and explain how you would trade off model quality, response speed, and annotation cost.
- Identify likely failure modes, how you would detect them before launch and in production, and what rollback or escalation policy you would use if reliability regresses.