Context
FitLoop, a consumer fitness app, wants to launch a new growth feature: a 7-day “streak challenge” card shown on the home screen to encourage users to invite friends and return daily. Product leadership is worried that any early lift may be driven by novelty rather than durable behavior change.
Hypothesis Seed
The team believes the streak challenge will increase 7-day retention and referral conversion by making progress more visible and socially shareable. However, they suspect the effect may spike in the first few days after exposure and then decay, so a naive read of aggregate lift could overstate long-term impact.
Constraints
- Eligible traffic: 240,000 active users per day
- Maximum experiment window: 21 days, after which the team must decide whether to ship broadly
- 70% of eligible users are on mobile, 30% on web
- A false positive is costly because the feature adds engineering and notification complexity; a false negative is moderate because the feature can be revisited next quarter
- The team can tolerate at most a 0.5 percentage point drop in day-1 activation and a 1.0 percentage point increase in app uninstall rate
Tasks
- Define a clear hypothesis, including how you would distinguish a novelty effect from a durable treatment effect.
- Choose the primary metric, 2-4 guardrails, and a minimum detectable effect (MDE), and justify the unit of randomization.
- Calculate the required sample size and whether the test can be completed within 21 days using the traffic provided.
- Pre-register an analysis plan covering the statistical test, how you will analyze treatment effect over time, peeking policy, and how you will handle multiple comparisons.
- State a ship / don’t-ship / iterate rule that explicitly accounts for novelty, guardrails, and any data quality issues such as sample ratio mismatch.