Context
ShopPilot is launching an AI shopping assistant on its ecommerce site. The feature answers product questions, summarizes reviews, and recommends next actions such as viewing a product page or adding an item to cart. Leadership wants a production-ready evaluation harness that measures answer quality and business impact before broad rollout.
Constraints
- p95 end-to-end evaluation latency for a single offline test case: under 3,000ms
- Online serving latency budget for the feature itself: under 1,500ms p95
- Evaluation cost ceiling: under $8,000/month at 200K evaluated production responses/month
- Hallucination ceiling: under 2% on a labeled golden set
- Prompt injection success rate: under 0.5% on adversarial tests
- The harness must support both offline model iteration and online experiment readouts
Available Resources
- 50K historical user questions, product catalog records, reviews, and policy documents
- 1,200 human-labeled examples with rubrics for factuality, relevance, and actionability
- Event data for product-page click, add-to-cart, checkout start, and purchase conversion
- Access to an approved LLM API, embeddings API, and a vector store
- A small trust-and-safety team that can label 100 adversarial examples per month
Task
Design a complete evaluation harness for this AI feature.
- Define the offline evaluation framework for hallucination, relevance, and business conversion proxies, including golden sets, LLM-as-judge usage, calibration, and programmatic checks.
- Define the online evaluation plan, including the primary success metric, guardrails, experiment design, and how you would attribute downstream conversion impact.
- Propose the system architecture needed to run evaluations continuously on prompt, retrieval, and model changes, including how you would store traces, labels, and versioned results.
- Explain how the harness detects and reports prompt injection, unsupported claims, and regressions by segment (for example, long-tail products or policy questions).
- Estimate cost and latency for the harness, and describe what you would simplify first if the budget or latency target is missed.