You fine-tuned a model for a domain-specific LLM feature, and now you need to decide whether it is actually better than the base model. Offline spot checks look promising, but you want a defensible evaluation plan before rollout.
How would you evaluate whether the fine-tuned model is better than the base model?