Evaluate a Prompt Change

Scenario

You maintain an LLM-powered feature and want to update the system prompt. You have a fixed set of 1,000 diverse user queries that represent real usage, and you need a reliable way to tell whether the new prompt is actually better or if it regresses performance on important cases.

Question

How would you design a test suite to evaluate whether a prompt change improved or degraded model performance across 1,000 diverse user queries?

Problem

Scenario

Question

How would you design a test suite to evaluate whether a prompt change improved or degraded model performance across 1,000 diverse user queries?

Problem

Scenario

Question

How would you design a test suite to evaluate whether a prompt change improved or degraded model performance across 1,000 diverse user queries?

Problem

Scenario

Question

How would you design a test suite to evaluate whether a prompt change improved or degraded model performance across 1,000 diverse user queries?

Interview Guides

Problem

Scenario

Question

Problem

Scenario

Question

Evaluate a Prompt Change

Problem

Scenario

Question

Problem

Scenario

Question