Context
FinFlow is launching an LLM-powered customer-support copilot that drafts answers for billing, refunds, and account-access questions using help-center articles, policy docs, and prior resolved tickets. The product team wants to know whether the application is reliable enough for agent-assist first, and later limited customer-facing use.
Constraints
- p95 latency: 2,500ms end-to-end
- Cost ceiling: $12K/month at 400K requests/month
- Reliability bar: at least 92% answer correctness on in-scope questions
- Hallucination ceiling: <2% unsupported factual claims on a labeled eval set
- Safety: prompt-injection success rate <1%, no PII leakage in outputs, and high-confidence refusal when the answer is not grounded in retrieved context
- Answers must include citations to source documents for policy claims
Available Resources
- 15K help-center and policy documents, versioned with timestamps and access controls
- 200K historical support conversations with final agent resolution notes
- A baseline RAG pipeline with hybrid retrieval already running in staging
- Access to GPT-4.1-mini / GPT-4.1 class models, embeddings, and a small internal labeling budget (support QA team can label 800 examples)
- Event logs containing user query, retrieved docs, model output, latency, cost, and whether the human agent edited the draft
Task
- Design an evaluation framework to determine whether the LLM application is reliable for this customer-support use case. Define what “reliable” means and how you would measure it offline before launch.
- Propose an online evaluation and monitoring plan after launch, including primary success metrics, guardrails, and alerting thresholds.
- Explain how you would test and quantify hallucination, refusal quality, retrieval failures, and prompt injection risk. Include how you would build the golden set and how much human labeling you would use.
- Recommend whether FinFlow should ship to agent-assist only, limited customer-facing traffic, or not launch yet. Justify the decision using cost, latency, and risk trade-offs.
- Outline any minimal architecture or prompt changes you would make only after the evaluation plan is defined.