Context
StreamSpace, a short-video app, is testing a new AI-generated “Daily Recap” carousel on the home feed. Product leadership wants to know whether any observed lift is durable or just a short-lived novelty effect from users interacting with something new.
Hypothesis Seed
The team believes the carousel will increase user engagement by helping viewers quickly find relevant content. However, they are concerned that early gains may be inflated because users click on new UI elements out of curiosity, then revert to baseline behavior after a few days.
Constraints
- Eligible traffic: 1.2M daily active users per day
- Only 60% of DAU land on the home feed and are eligible
- Maximum experiment duration: 21 days
- Engineering wants a decision within 3 weeks to include the feature in the next release train
- A false positive is costly because the feature requires ongoing inference spend of about $180K/month
- A false negative is also costly because home-feed engagement is a top company KPI
Task
- Define a clear null and alternative hypothesis, including how you will distinguish a true sustained lift from a novelty-driven spike.
- Choose the primary metric, 2-4 guardrail metrics, and at least one secondary metric. Specify the unit of randomization and unit of analysis, and justify both.
- Calculate the required sample size for the primary metric using an explicit MDE, then translate that into expected runtime given the available traffic.
- Propose a pre-registered analysis plan that addresses novelty effects, peeking, and multiple comparisons. Be explicit about whether you will analyze the full 21-day average, time-sliced effects, or both.
- State a ship / don’t-ship / iterate rule that respects guardrails and explains what you would do if the feature shows a strong week-1 lift that fades by week 3.
Your answer should be concrete: use the numbers above, show the power calculation, and explain how you would avoid over-interpreting short-term excitement as long-term product value.