Resolve Conflicting Metric Test Results

Scenario

You work on an AI-assisted customer support product and want to test a new agent workflow that surfaces a more prominent suggested reply and next-best-action panel during live chats. The team believes the change will help agents resolve more conversations without escalation and reduce average handle time, but there is concern that faster resolutions could feel less personalized and lower customer satisfaction. You need an experiment design that can handle a likely lift in one metric and a drop in another.

Constraints

Eligible traffic: 180,000 support conversations per day
Maximum experiment duration: 21 days
Only 60% of conversations are handled by agents eligible for the new workflow
Customer satisfaction cannot decline by more than 0.5 percentage points
Average handle time cannot worsen by more than 3%

Question

How would you design this experiment so that you can make a clear ship or don’t-ship decision if the treatment improves one key outcome but harms another? Be explicit about your hypothesis, metric hierarchy, power and MDE, randomization choice, analysis plan, and the pitfalls you would guard against before interpreting the result.

Scenario

Question

Scenario

Question

Scenario

Question

Interview Guides

Scenario

Constraints

Question

Resolve Conflicting Metric Test Results

Scenario

Constraints

Question

Your Answer

Resolve Conflicting Metric Test Results

Scenario

Constraints

Question

Resolve Conflicting Metric Test Results

Scenario

Constraints

Question

Your Answer