Context
ConnectHub, a social networking app, wants to test a new home-feed ranking algorithm that prioritizes “meaningful interactions” over recency. The team needs a launch decision quickly because the current model is underperforming on engagement.
Hypothesis Seed
The new ranker is expected to increase 7-day engaged-user rate by surfacing more relevant posts, but it may also reduce ad revenue or create interference because users interact with content produced by other users. This experiment is meant to assess not just lift, but whether the design is robust to common online experimentation pitfalls.
Constraints
- Eligible traffic: 1.2M daily active users per day
- Maximum experiment duration: 14 days
- 50/50 allocation after a 1-day 5% ramp for instrumentation checks
- Baseline 7-day engaged-user rate: 32%
- Smallest business-relevant lift: 2% relative
- False positives are costly because shipping a bad ranker affects the entire feed ecosystem and ad revenue
- False negatives are acceptable up to a point; the team can iterate next sprint if the result is inconclusive
Deliverables
- Define the null and alternative hypotheses, the primary metric, 2-4 guardrail metrics, and an explicit MDE.
- Calculate the required sample size and determine whether the experiment can be completed within 14 days given the available traffic.
- Choose the unit of randomization and explain how you would handle likely pitfalls such as peeking, novelty effects, network interference/SUTVA violations, and sample ratio mismatch.
- Pre-register an analysis plan: statistical test, treatment of secondary metrics, multiple-comparison policy, and when results will be read.
- State a clear ship / don’t-ship / iterate rule that respects both the primary metric and guardrails.
Your answer should be concrete: use the numbers above, show the sample-size math, and explain which pitfalls are most dangerous in this specific experiment rather than listing generic issues.