Context
PulseChat, a consumer messaging app, is testing a new feature called Quick Reactions that lets users respond to messages with one-tap animated reactions. Product leadership wants to know whether the feature creates durable engagement or just a short-lived novelty spike.
Hypothesis Seed
The team believes Quick Reactions will increase meaningful conversation engagement by making lightweight responses easier. However, they are explicitly worried that users may overuse the feature in the first few days because it is new and visually salient, with the effect fading after novelty wears off.
Constraints
- Eligible traffic: 240,000 daily active users per day
- 80% of DAU are eligible for the experiment after app-version filtering
- Maximum decision window: 21 days
- Randomization must be user-level because the feature is persistent in the UI
- A false positive is costly: shipping a novelty-only feature adds long-term UI clutter and engineering maintenance
- A false negative is acceptable if the true lift is very small
- The team wants at least 80% power at a 5% significance level
Task
- Define the null and alternative hypotheses, making clear how you will distinguish a novelty effect from a durable treatment effect.
- Choose a primary metric, 2-4 guardrails, and at least one secondary metric. State the baseline and an explicit MDE.
- Calculate the required sample size and show whether the experiment can be completed within 21 days given available traffic.
- Propose the experiment design: unit of randomization, allocation, duration, and any stratification. Explain how your design helps detect novelty effects rather than just average lift.
- Pre-register the analysis plan, including the statistical test, peeking policy, multiple-comparison policy, and how you will make the final ship / don’t-ship decision if week 1 is positive but week 3 is flat or negative.
Be concrete: use real numbers, not placeholders. Your answer should explicitly address novelty effects as a first-class risk, not just mention them in passing.