Evaluate AI Support Workflow Impact

Context

BrightDesk is a B2B customer-support platform testing an LLM-powered support workflow for small-business customers. The workflow drafts replies, retrieves help-center content, and suggests next actions to agents; leadership wants to know whether it actually helps customers succeed rather than just reducing handle time.

Constraints

p95 end-to-end latency: 2,500ms per assistant turn
Cost ceiling: $0.035 per assisted conversation turn, or $45K/month at 1.4M turns
Hallucination ceiling: <2% of responses on a labeled evaluation set may contain unsupported policy or product claims
Safety: must resist prompt injection from customer messages or retrieved docs, avoid leaking PII, and refuse when evidence is insufficient
Business goal: improve customer task success without increasing reopen rate, escalations, or compliance incidents

Available Resources

18 months of support conversations with outcomes: resolved/not resolved, reopen within 7 days, CSAT, refund issued, escalation, and retention at 30 days
Product help-center articles, internal support macros, policy docs, and troubleshooting guides
Event logs for agent actions and customer follow-up behavior
A baseline workflow without LLM assistance and a current pilot using retrieval + generation
Access to GPT-4.1-mini or Claude Sonnet-class models, embeddings, and a hybrid search index

Task

Define an evaluation framework that determines whether the LLM workflow improves customer success, not just agent efficiency. Specify primary metrics, guardrails, and how you would segment results by issue type and customer tier.
Design the offline evaluation suite first: golden set construction, hallucination and faithfulness checks, prompt-injection tests, and how you would calibrate any LLM-as-judge rubric against human labels.
Propose the online evaluation plan: experiment design, success criteria, unit of randomization, and how to handle confounders such as agent learning effects and issue-mix shifts.
Outline the workflow architecture and prompt strategy needed to support the evaluation goals, including grounded answering, citation requirements, and refusal behavior.
Estimate cost and latency, then explain what you would simplify or change if the system misses either budget while preserving customer-success gains.

Context

Constraints

p95 end-to-end latency: 2,500ms per assistant turn
Cost ceiling: $0.035 per assisted conversation turn, or $45K/month at 1.4M turns
Hallucination ceiling: <2% of responses on a labeled evaluation set may contain unsupported policy or product claims
Safety: must resist prompt injection from customer messages or retrieved docs, avoid leaking PII, and refuse when evidence is insufficient
Business goal: improve customer task success without increasing reopen rate, escalations, or compliance incidents

Available Resources

18 months of support conversations with outcomes: resolved/not resolved, reopen within 7 days, CSAT, refund issued, escalation, and retention at 30 days
Product help-center articles, internal support macros, policy docs, and troubleshooting guides
Event logs for agent actions and customer follow-up behavior
A baseline workflow without LLM assistance and a current pilot using retrieval + generation
Access to GPT-4.1-mini or Claude Sonnet-class models, embeddings, and a hybrid search index

Task

Define an evaluation framework that determines whether the LLM workflow improves customer success, not just agent efficiency. Specify primary metrics, guardrails, and how you would segment results by issue type and customer tier.
Design the offline evaluation suite first: golden set construction, hallucination and faithfulness checks, prompt-injection tests, and how you would calibrate any LLM-as-judge rubric against human labels.
Propose the online evaluation plan: experiment design, success criteria, unit of randomization, and how to handle confounders such as agent learning effects and issue-mix shifts.
Outline the workflow architecture and prompt strategy needed to support the evaluation goals, including grounded answering, citation requirements, and refusal behavior.
Estimate cost and latency, then explain what you would simplify or change if the system misses either budget while preserving customer-success gains.

Context

Constraints

p95 end-to-end latency: 2,500ms per assistant turn
Cost ceiling: $0.035 per assisted conversation turn, or $45K/month at 1.4M turns
Hallucination ceiling: <2% of responses on a labeled evaluation set may contain unsupported policy or product claims
Safety: must resist prompt injection from customer messages or retrieved docs, avoid leaking PII, and refuse when evidence is insufficient
Business goal: improve customer task success without increasing reopen rate, escalations, or compliance incidents

Available Resources

18 months of support conversations with outcomes: resolved/not resolved, reopen within 7 days, CSAT, refund issued, escalation, and retention at 30 days
Product help-center articles, internal support macros, policy docs, and troubleshooting guides
Event logs for agent actions and customer follow-up behavior
A baseline workflow without LLM assistance and a current pilot using retrieval + generation
Access to GPT-4.1-mini or Claude Sonnet-class models, embeddings, and a hybrid search index

Task

Define an evaluation framework that determines whether the LLM workflow improves customer success, not just agent efficiency. Specify primary metrics, guardrails, and how you would segment results by issue type and customer tier.
Design the offline evaluation suite first: golden set construction, hallucination and faithfulness checks, prompt-injection tests, and how you would calibrate any LLM-as-judge rubric against human labels.
Propose the online evaluation plan: experiment design, success criteria, unit of randomization, and how to handle confounders such as agent learning effects and issue-mix shifts.
Outline the workflow architecture and prompt strategy needed to support the evaluation goals, including grounded answering, citation requirements, and refusal behavior.
Estimate cost and latency, then explain what you would simplify or change if the system misses either budget while preserving customer-success gains.

Context

Constraints

p95 end-to-end latency: 2,500ms per assistant turn
Cost ceiling: $0.035 per assisted conversation turn, or $45K/month at 1.4M turns
Hallucination ceiling: <2% of responses on a labeled evaluation set may contain unsupported policy or product claims
Safety: must resist prompt injection from customer messages or retrieved docs, avoid leaking PII, and refuse when evidence is insufficient
Business goal: improve customer task success without increasing reopen rate, escalations, or compliance incidents

Available Resources

18 months of support conversations with outcomes: resolved/not resolved, reopen within 7 days, CSAT, refund issued, escalation, and retention at 30 days
Product help-center articles, internal support macros, policy docs, and troubleshooting guides
Event logs for agent actions and customer follow-up behavior
A baseline workflow without LLM assistance and a current pilot using retrieval + generation
Access to GPT-4.1-mini or Claude Sonnet-class models, embeddings, and a hybrid search index

Task

Define an evaluation framework that determines whether the LLM workflow improves customer success, not just agent efficiency. Specify primary metrics, guardrails, and how you would segment results by issue type and customer tier.
Design the offline evaluation suite first: golden set construction, hallucination and faithfulness checks, prompt-injection tests, and how you would calibrate any LLM-as-judge rubric against human labels.
Propose the online evaluation plan: experiment design, success criteria, unit of randomization, and how to handle confounders such as agent learning effects and issue-mix shifts.
Outline the workflow architecture and prompt strategy needed to support the evaluation goals, including grounded answering, citation requirements, and refusal behavior.
Estimate cost and latency, then explain what you would simplify or change if the system misses either budget while preserving customer-success gains.

Interview Guides

Context

Constraints

Available Resources

Task

Evaluate AI Support Workflow Impact

Context

Constraints

Available Resources

Task

Your Answer

Evaluate AI Support Workflow Impact

Context

Constraints

Available Resources

Task

Evaluate AI Support Workflow Impact

Context

Constraints

Available Resources

Task

Your Answer