Evaluate Support Copilot Reliability

Context

FinFlow is launching an LLM-powered customer-support copilot that drafts answers for billing, refunds, and account-access questions using help-center articles, policy docs, and prior resolved tickets. The product team wants to know whether the application is reliable enough for agent-assist first, and later limited customer-facing use.

Constraints

p95 latency: 2,500ms end-to-end
Cost ceiling: $12K/month at 400K requests/month
Reliability bar: at least 92% answer correctness on in-scope questions
Hallucination ceiling: <2% unsupported factual claims on a labeled eval set
Safety: prompt-injection success rate <1%, no PII leakage in outputs, and high-confidence refusal when the answer is not grounded in retrieved context
Answers must include citations to source documents for policy claims

Available Resources

15K help-center and policy documents, versioned with timestamps and access controls
200K historical support conversations with final agent resolution notes
A baseline RAG pipeline with hybrid retrieval already running in staging
Access to GPT-4.1-mini / GPT-4.1 class models, embeddings, and a small internal labeling budget (support QA team can label 800 examples)
Event logs containing user query, retrieved docs, model output, latency, cost, and whether the human agent edited the draft

Task

Design an evaluation framework to determine whether the LLM application is reliable for this customer-support use case. Define what “reliable” means and how you would measure it offline before launch.
Propose an online evaluation and monitoring plan after launch, including primary success metrics, guardrails, and alerting thresholds.
Explain how you would test and quantify hallucination, refusal quality, retrieval failures, and prompt injection risk. Include how you would build the golden set and how much human labeling you would use.
Recommend whether FinFlow should ship to agent-assist only, limited customer-facing traffic, or not launch yet. Justify the decision using cost, latency, and risk trade-offs.
Outline any minimal architecture or prompt changes you would make only after the evaluation plan is defined.

Context

Constraints

p95 latency: 2,500ms end-to-end
Cost ceiling: $12K/month at 400K requests/month
Reliability bar: at least 92% answer correctness on in-scope questions
Hallucination ceiling: <2% unsupported factual claims on a labeled eval set
Safety: prompt-injection success rate <1%, no PII leakage in outputs, and high-confidence refusal when the answer is not grounded in retrieved context
Answers must include citations to source documents for policy claims

Available Resources

15K help-center and policy documents, versioned with timestamps and access controls
200K historical support conversations with final agent resolution notes
A baseline RAG pipeline with hybrid retrieval already running in staging
Access to GPT-4.1-mini / GPT-4.1 class models, embeddings, and a small internal labeling budget (support QA team can label 800 examples)
Event logs containing user query, retrieved docs, model output, latency, cost, and whether the human agent edited the draft

Task

Design an evaluation framework to determine whether the LLM application is reliable for this customer-support use case. Define what “reliable” means and how you would measure it offline before launch.
Propose an online evaluation and monitoring plan after launch, including primary success metrics, guardrails, and alerting thresholds.
Explain how you would test and quantify hallucination, refusal quality, retrieval failures, and prompt injection risk. Include how you would build the golden set and how much human labeling you would use.
Recommend whether FinFlow should ship to agent-assist only, limited customer-facing traffic, or not launch yet. Justify the decision using cost, latency, and risk trade-offs.
Outline any minimal architecture or prompt changes you would make only after the evaluation plan is defined.

Context

Constraints

p95 latency: 2,500ms end-to-end
Cost ceiling: $12K/month at 400K requests/month
Reliability bar: at least 92% answer correctness on in-scope questions
Hallucination ceiling: <2% unsupported factual claims on a labeled eval set
Safety: prompt-injection success rate <1%, no PII leakage in outputs, and high-confidence refusal when the answer is not grounded in retrieved context
Answers must include citations to source documents for policy claims

Available Resources

15K help-center and policy documents, versioned with timestamps and access controls
200K historical support conversations with final agent resolution notes
A baseline RAG pipeline with hybrid retrieval already running in staging
Access to GPT-4.1-mini / GPT-4.1 class models, embeddings, and a small internal labeling budget (support QA team can label 800 examples)
Event logs containing user query, retrieved docs, model output, latency, cost, and whether the human agent edited the draft

Task

Design an evaluation framework to determine whether the LLM application is reliable for this customer-support use case. Define what “reliable” means and how you would measure it offline before launch.
Propose an online evaluation and monitoring plan after launch, including primary success metrics, guardrails, and alerting thresholds.
Explain how you would test and quantify hallucination, refusal quality, retrieval failures, and prompt injection risk. Include how you would build the golden set and how much human labeling you would use.
Recommend whether FinFlow should ship to agent-assist only, limited customer-facing traffic, or not launch yet. Justify the decision using cost, latency, and risk trade-offs.
Outline any minimal architecture or prompt changes you would make only after the evaluation plan is defined.

Context

Constraints

p95 latency: 2,500ms end-to-end
Cost ceiling: $12K/month at 400K requests/month
Reliability bar: at least 92% answer correctness on in-scope questions
Hallucination ceiling: <2% unsupported factual claims on a labeled eval set
Safety: prompt-injection success rate <1%, no PII leakage in outputs, and high-confidence refusal when the answer is not grounded in retrieved context
Answers must include citations to source documents for policy claims

Available Resources

15K help-center and policy documents, versioned with timestamps and access controls
200K historical support conversations with final agent resolution notes
A baseline RAG pipeline with hybrid retrieval already running in staging
Access to GPT-4.1-mini / GPT-4.1 class models, embeddings, and a small internal labeling budget (support QA team can label 800 examples)
Event logs containing user query, retrieved docs, model output, latency, cost, and whether the human agent edited the draft

Task

Design an evaluation framework to determine whether the LLM application is reliable for this customer-support use case. Define what “reliable” means and how you would measure it offline before launch.
Propose an online evaluation and monitoring plan after launch, including primary success metrics, guardrails, and alerting thresholds.
Explain how you would test and quantify hallucination, refusal quality, retrieval failures, and prompt injection risk. Include how you would build the golden set and how much human labeling you would use.
Recommend whether FinFlow should ship to agent-assist only, limited customer-facing traffic, or not launch yet. Justify the decision using cost, latency, and risk trade-offs.
Outline any minimal architecture or prompt changes you would make only after the evaluation plan is defined.

Interview Guides

Context

Constraints

Available Resources

Task

Evaluate Support Copilot Reliability

Context

Constraints

Available Resources

Task

Your Answer

Evaluate Support Copilot Reliability

Context

Constraints

Available Resources

Task

Evaluate Support Copilot Reliability

Context

Constraints

Available Resources

Task

Your Answer