
You are working on a generative AI application that uses an LLM to answer user questions and draft responses from internal context. The team needs a repeatable way to judge output quality before launch and after each prompt or model change.
How do you evaluate LLM outputs in a Generative AI application?