Interview Guides

Design an LLM Evaluation Harness

Hard

Generative AI & LLMs

Context

ShopPilot is launching an AI shopping assistant on its ecommerce site. The feature answers product questions, summarizes reviews, and recommends next actions such as viewing a product page or adding an item to cart. Leadership wants a production-ready evaluation harness that measures answer quality and business impact before broad rollout.

Constraints

p95 end-to-end evaluation latency for a single offline test case: under 3,000ms
Online serving latency budget for the feature itself: under 1,500ms p95
Evaluation cost ceiling: under $8,000/month at 200K evaluated production responses/month
Hallucination ceiling: under 2% on a labeled golden set
Prompt injection success rate: under 0.5% on adversarial tests
The harness must support both offline model iteration and online experiment readouts

Available Resources

50K historical user questions, product catalog records, reviews, and policy documents
1,200 human-labeled examples with rubrics for factuality, relevance, and actionability
Event data for product-page click, add-to-cart, checkout start, and purchase conversion
Access to an approved LLM API, embeddings API, and a vector store
A small trust-and-safety team that can label 100 adversarial examples per month

Task

Design a complete evaluation harness for this AI feature.

Define the offline evaluation framework for hallucination, relevance, and business conversion proxies, including golden sets, LLM-as-judge usage, calibration, and programmatic checks.
Define the online evaluation plan, including the primary success metric, guardrails, experiment design, and how you would attribute downstream conversion impact.
Propose the system architecture needed to run evaluations continuously on prompt, retrieval, and model changes, including how you would store traces, labels, and versioned results.
Explain how the harness detects and reports prompt injection, unsupported claims, and regressions by segment (for example, long-tail products or policy questions).
Estimate cost and latency for the harness, and describe what you would simplify first if the budget or latency target is missed.

Design an LLM Evaluation Harness

Hard

Generative AI & LLMs

Context

Constraints

p95 end-to-end evaluation latency for a single offline test case: under 3,000ms
Online serving latency budget for the feature itself: under 1,500ms p95
Evaluation cost ceiling: under $8,000/month at 200K evaluated production responses/month
Hallucination ceiling: under 2% on a labeled golden set
Prompt injection success rate: under 0.5% on adversarial tests
The harness must support both offline model iteration and online experiment readouts

Available Resources

50K historical user questions, product catalog records, reviews, and policy documents
1,200 human-labeled examples with rubrics for factuality, relevance, and actionability
Event data for product-page click, add-to-cart, checkout start, and purchase conversion
Access to an approved LLM API, embeddings API, and a vector store
A small trust-and-safety team that can label 100 adversarial examples per month

Task

Design a complete evaluation harness for this AI feature.

Define the offline evaluation framework for hallucination, relevance, and business conversion proxies, including golden sets, LLM-as-judge usage, calibration, and programmatic checks.
Define the online evaluation plan, including the primary success metric, guardrails, experiment design, and how you would attribute downstream conversion impact.
Propose the system architecture needed to run evaluations continuously on prompt, retrieval, and model changes, including how you would store traces, labels, and versioned results.
Explain how the harness detects and reports prompt injection, unsupported claims, and regressions by segment (for example, long-tail products or policy questions).
Estimate cost and latency for the harness, and describe what you would simplify first if the budget or latency target is missed.

Your Answer

Design an LLM Evaluation Harness

Hard

Generative AI & LLMs

Context

Constraints

p95 end-to-end evaluation latency for a single offline test case: under 3,000ms
Online serving latency budget for the feature itself: under 1,500ms p95
Evaluation cost ceiling: under $8,000/month at 200K evaluated production responses/month
Hallucination ceiling: under 2% on a labeled golden set
Prompt injection success rate: under 0.5% on adversarial tests
The harness must support both offline model iteration and online experiment readouts

Available Resources

50K historical user questions, product catalog records, reviews, and policy documents
1,200 human-labeled examples with rubrics for factuality, relevance, and actionability
Event data for product-page click, add-to-cart, checkout start, and purchase conversion
Access to an approved LLM API, embeddings API, and a vector store
A small trust-and-safety team that can label 100 adversarial examples per month

Task

Design a complete evaluation harness for this AI feature.

Define the offline evaluation framework for hallucination, relevance, and business conversion proxies, including golden sets, LLM-as-judge usage, calibration, and programmatic checks.
Define the online evaluation plan, including the primary success metric, guardrails, experiment design, and how you would attribute downstream conversion impact.
Propose the system architecture needed to run evaluations continuously on prompt, retrieval, and model changes, including how you would store traces, labels, and versioned results.
Explain how the harness detects and reports prompt injection, unsupported claims, and regressions by segment (for example, long-tail products or policy questions).
Estimate cost and latency for the harness, and describe what you would simplify first if the budget or latency target is missed.

Design an LLM Evaluation Harness

Hard

Generative AI & LLMs

Context

Constraints

p95 end-to-end evaluation latency for a single offline test case: under 3,000ms
Online serving latency budget for the feature itself: under 1,500ms p95
Evaluation cost ceiling: under $8,000/month at 200K evaluated production responses/month
Hallucination ceiling: under 2% on a labeled golden set
Prompt injection success rate: under 0.5% on adversarial tests
The harness must support both offline model iteration and online experiment readouts

Available Resources

50K historical user questions, product catalog records, reviews, and policy documents
1,200 human-labeled examples with rubrics for factuality, relevance, and actionability
Event data for product-page click, add-to-cart, checkout start, and purchase conversion
Access to an approved LLM API, embeddings API, and a vector store
A small trust-and-safety team that can label 100 adversarial examples per month

Task

Design a complete evaluation harness for this AI feature.

Define the offline evaluation framework for hallucination, relevance, and business conversion proxies, including golden sets, LLM-as-judge usage, calibration, and programmatic checks.
Define the online evaluation plan, including the primary success metric, guardrails, experiment design, and how you would attribute downstream conversion impact.
Propose the system architecture needed to run evaluations continuously on prompt, retrieval, and model changes, including how you would store traces, labels, and versioned results.
Explain how the harness detects and reports prompt injection, unsupported claims, and regressions by segment (for example, long-tail products or policy questions).
Estimate cost and latency for the harness, and describe what you would simplify first if the budget or latency target is missed.