Describe a situation where you had to work across research, product, and infrastructure to define an LLM evaluation strategy and translate it into a production serving design. How did you align stakeholders who had different goals, such as quality, latency, cost, and safety, and how did you handle disagreements about what metrics mattered most? Please use a specific example and explain the result.