Context
FinPilot is building an LLM-powered assistant that answers customer questions about credit-card benefits, fees, disputes, and account policies inside its mobile app. The team has a working prototype, but leadership will only ship if the feature is demonstrably useful and safe.
Constraints
- p95 latency: 2,500ms end-to-end
- Cost ceiling: $0.03 per request and $45K/month at 1.5M requests
- Usefulness bar: at least 75% of evaluated answers are rated helpful on a representative test set
- Safety bar: hallucination rate below 2%, prompt-injection success rate below 0.5%, and unsafe financial guidance rate below 0.1%
- The assistant must refuse when policy information is missing, ambiguous, or unsupported
- Responses must not leak PII or internal-only policy text
Available Resources
- 40K internal policy documents, help-center articles, and compliance FAQs
- 8K historical customer-support chats with human resolutions
- 1K manually labeled evaluation examples across common, ambiguous, and adversarial queries
- Access to an approved GPT-4-class model and a cheaper small model for routing or grading
- Product analytics events: follow-up question rate, escalation to human support, thumbs up/down
Task
- Define a ship-readiness evaluation framework that determines whether the feature is safe and useful enough to launch, including explicit offline metrics, thresholds, and pass/fail criteria.
- Design the online evaluation plan for a limited rollout, including success metrics, guardrails, segmentation, and rollback triggers.
- Propose the minimum architecture and prompting approach needed to meet the evaluation bar, including how the system should refuse, cite evidence, and defend against prompt injection.
- Explain how you would calibrate LLM-as-judge, where human review is required, and how you would monitor regressions after launch.
- Estimate cost and latency for the evaluation and serving approach, and describe the main tradeoffs you would make if the system misses either budget.