Context
Streamly, a video streaming app, is testing a redesigned autoplay-next panel on the episode end screen. Product leadership wants to know how to act if the change improves the main conversion metric but harms a safety metric.
Hypothesis Seed
The new panel makes the next episode more prominent and is expected to increase the share of episode completions that lead to another episode start. However, the team worries it may also increase early-session exits from annoyance or make the app feel pushy.
Constraints
- Eligible traffic: 240,000 episode-end impressions per day
- Randomization must be at the
user_id level to avoid a user seeing both designs in one binge session
- Maximum experiment duration: 14 days
- Allocation target: 50/50 after a 1-day 5% ramp for instrumentation checks
- False positives are costly because a bad rollout could hurt long-term retention; false negatives are acceptable if the effect is small
- The team will only ship if the gain is practically meaningful and no pre-registered guardrail is materially harmed
Task
- Define the hypothesis, primary metric, guardrails, and a clear minimum detectable effect (MDE).
- Calculate the required sample size and determine whether the test can be completed within the 14-day limit using the available traffic.
- Choose the experiment design: unit of randomization, allocation, duration, and any stratification or blocking.
- Pre-register the analysis plan, including the statistical test, peeking policy, handling of multiple metrics, and how to deal with user-level randomization but event-level outcomes.
- State an explicit ship / don’t-ship / iterate rule for the case where the primary metric improves but a guardrail metric gets worse. Be specific about how large a guardrail decline is tolerable, if any, and how you would interpret statistical vs practical significance.