

You maintain an LLM-powered feature and want to update the system prompt. You have a fixed set of 1,000 diverse user queries that represent real usage, and you need a reliable way to tell whether the new prompt is actually better or if it regresses performance on important cases.
How would you design a test suite to evaluate whether a prompt change improved or degraded model performance across 1,000 diverse user queries?