Context
Streamly, a short-video app, is testing a redesigned home feed that surfaces more personalized creators. The product team wants to know not just whether the variant wins, but whether the experiment can be trusted.
Hypothesis Seed
The new feed ranking is expected to increase the rate at which users watch at least one video for 30+ seconds after opening the app, because recommendations should feel more relevant. However, the change could also hurt ad revenue, session stability, or create misleading results if the test is analyzed poorly.
Constraints
- Eligible traffic: 600,000 daily active users per day
- Maximum experiment duration: 14 days
- Allocation target: 50/50 after a 1-day 5% ramp for instrumentation checks
- Baseline 30-second watch-start conversion: 40%
- Smallest business-relevant lift: 2% relative
- False positives are costly because a bad ranking launch can reduce creator exposure and ad revenue; false negatives are acceptable if the team can rerun later
Deliverables
- State the null and alternative hypotheses, define the primary metric, and choose 2-4 guardrail metrics with thresholds.
- Compute the required sample size per arm using the stated baseline, alpha, power, and MDE, and translate that into expected runtime under the traffic constraint.
- Choose the unit of randomization and explain how you would analyze the test if the metric is measured at a different level than randomization.
- Pre-register an analysis plan covering the statistical test, peeking policy, multiple-comparisons treatment, and a ship / don't-ship rule.
- Identify the main pitfalls you would watch for when analyzing this A/B test, including at least peeking, novelty effects, network interference or SUTVA concerns, and sample ratio mismatch, and explain how you would mitigate each.