Evaluate Support Copilot Reliability

Context

OpenAI is piloting a Customer Success copilot that drafts answers for enterprise admins using help center articles, product docs, and prior resolved tickets. The team needs a rigorous way to determine whether the application is reliable enough to assist agents and, later, answer some low-risk questions directly.

Constraints

p95 latency: 2,500ms per response in the agent-assist workflow
Cost ceiling: $12K/month at 300K requests/month
Reliability bar: at least 92% rubric pass rate on a golden set for in-scope questions
Hallucination ceiling: fewer than 2% unsupported factual claims
Safety: prompt injection success rate below 0.5%, no leakage of hidden instructions, and no exposure of customer PII in logs
The system must prefer refusal / escalation over guessing when evidence is weak

Available Resources

15,000 OpenAI help center and product documentation pages with metadata (surface, product, last updated)
80,000 historical support tickets with final agent resolution notes
OpenAI models for generation, embeddings, and structured outputs
20 Customer Success specialists available to label a golden set and calibrate graders
Existing telemetry: user feedback, agent edits, escalation events, and time-to-resolution

Task

Design an evaluation-first framework for this LLM application: define what “reliable” means for this customer use case, including offline metrics, online metrics, pass/fail thresholds, and segmentation.
Propose the testing methodology for factuality, instruction-following, refusal behavior, prompt injection resistance, and consistency across common support intents.
Specify the minimal application architecture needed to support that evaluation, including prompt design, retrieval assumptions if used, and structured logging for audits.
Estimate cost and latency for your proposed eval and serving setup, and explain how you would trade off model quality, response speed, and annotation cost.
Identify likely failure modes, how you would detect them before launch and in production, and what rollback or escalation policy you would use if reliability regresses.

Context

Constraints

p95 latency: 2,500ms per response in the agent-assist workflow
Cost ceiling: $12K/month at 300K requests/month
Reliability bar: at least 92% rubric pass rate on a golden set for in-scope questions
Hallucination ceiling: fewer than 2% unsupported factual claims
Safety: prompt injection success rate below 0.5%, no leakage of hidden instructions, and no exposure of customer PII in logs
The system must prefer refusal / escalation over guessing when evidence is weak

Available Resources

15,000 OpenAI help center and product documentation pages with metadata (surface, product, last updated)
80,000 historical support tickets with final agent resolution notes
OpenAI models for generation, embeddings, and structured outputs
20 Customer Success specialists available to label a golden set and calibrate graders
Existing telemetry: user feedback, agent edits, escalation events, and time-to-resolution

Task

Design an evaluation-first framework for this LLM application: define what “reliable” means for this customer use case, including offline metrics, online metrics, pass/fail thresholds, and segmentation.
Propose the testing methodology for factuality, instruction-following, refusal behavior, prompt injection resistance, and consistency across common support intents.
Specify the minimal application architecture needed to support that evaluation, including prompt design, retrieval assumptions if used, and structured logging for audits.
Estimate cost and latency for your proposed eval and serving setup, and explain how you would trade off model quality, response speed, and annotation cost.
Identify likely failure modes, how you would detect them before launch and in production, and what rollback or escalation policy you would use if reliability regresses.

Context

Constraints

p95 latency: 2,500ms per response in the agent-assist workflow
Cost ceiling: $12K/month at 300K requests/month
Reliability bar: at least 92% rubric pass rate on a golden set for in-scope questions
Hallucination ceiling: fewer than 2% unsupported factual claims
Safety: prompt injection success rate below 0.5%, no leakage of hidden instructions, and no exposure of customer PII in logs
The system must prefer refusal / escalation over guessing when evidence is weak

Available Resources

15,000 OpenAI help center and product documentation pages with metadata (surface, product, last updated)
80,000 historical support tickets with final agent resolution notes
OpenAI models for generation, embeddings, and structured outputs
20 Customer Success specialists available to label a golden set and calibrate graders
Existing telemetry: user feedback, agent edits, escalation events, and time-to-resolution

Task

Design an evaluation-first framework for this LLM application: define what “reliable” means for this customer use case, including offline metrics, online metrics, pass/fail thresholds, and segmentation.
Propose the testing methodology for factuality, instruction-following, refusal behavior, prompt injection resistance, and consistency across common support intents.
Specify the minimal application architecture needed to support that evaluation, including prompt design, retrieval assumptions if used, and structured logging for audits.
Estimate cost and latency for your proposed eval and serving setup, and explain how you would trade off model quality, response speed, and annotation cost.
Identify likely failure modes, how you would detect them before launch and in production, and what rollback or escalation policy you would use if reliability regresses.

Context

Constraints

p95 latency: 2,500ms per response in the agent-assist workflow
Cost ceiling: $12K/month at 300K requests/month
Reliability bar: at least 92% rubric pass rate on a golden set for in-scope questions
Hallucination ceiling: fewer than 2% unsupported factual claims
Safety: prompt injection success rate below 0.5%, no leakage of hidden instructions, and no exposure of customer PII in logs
The system must prefer refusal / escalation over guessing when evidence is weak

Available Resources

15,000 OpenAI help center and product documentation pages with metadata (surface, product, last updated)
80,000 historical support tickets with final agent resolution notes
OpenAI models for generation, embeddings, and structured outputs
20 Customer Success specialists available to label a golden set and calibrate graders
Existing telemetry: user feedback, agent edits, escalation events, and time-to-resolution

Task

Design an evaluation-first framework for this LLM application: define what “reliable” means for this customer use case, including offline metrics, online metrics, pass/fail thresholds, and segmentation.
Propose the testing methodology for factuality, instruction-following, refusal behavior, prompt injection resistance, and consistency across common support intents.
Specify the minimal application architecture needed to support that evaluation, including prompt design, retrieval assumptions if used, and structured logging for audits.
Estimate cost and latency for your proposed eval and serving setup, and explain how you would trade off model quality, response speed, and annotation cost.
Identify likely failure modes, how you would detect them before launch and in production, and what rollback or escalation policy you would use if reliability regresses.

Interview Guides

Context

Constraints

Available Resources

Task

Evaluate Support Copilot Reliability

Context

Constraints

Available Resources

Task

Your Answer

Evaluate Support Copilot Reliability

Context

Constraints

Available Resources

Task

Evaluate Support Copilot Reliability

Context

Constraints

Available Resources

Task

Your Answer