

b
You are working on an LLM-powered product feature and need a clear way to judge whether the model is good enough to ship and improve over time. The outputs are open-ended, so simple accuracy is not enough.
How do you evaluate the performance of a generative model?