You work on an AI-assisted customer support product and want to test a new agent workflow that surfaces a more prominent suggested reply and next-best-action panel during live chats. The team believes the change will help agents resolve more conversations without escalation and reduce average handle time, but there is concern that faster resolutions could feel less personalized and lower customer satisfaction. You need an experiment design that can handle a likely lift in one metric and a drop in another.
How would you design this experiment so that you can make a clear ship or don’t-ship decision if the treatment improves one key outcome but harms another? Be explicit about your hypothesis, metric hierarchy, power and MDE, randomization choice, analysis plan, and the pitfalls you would guard against before interpreting the result.