Context
StreamCart, a grocery delivery app, is testing a new "Smart Reorder" widget on the home screen that suggests frequently purchased items. Product leadership wants to know not only whether the widget improves reorder conversion, but also what guardrails must be in place before a broader rollout.
Hypothesis Seed
The widget is expected to reduce friction for repeat buyers and increase weekly reorder rate. However, it could also create unintended harm by slowing app performance, cannibalizing discovery of higher-margin items, or increasing support contacts if recommendations look wrong.
Constraints
- Eligible traffic: 240,000 returning users per day
- Maximum experiment duration: 14 days
- Rollout decision required at the end of the test window
- False positives are costly because a bad rollout could hurt revenue and trust at scale
- False negatives are acceptable up to a point; the team would rather miss a small win than ship a harmful experience
Task
- Define the experiment hypothesis, the primary success metric, and 2-4 guardrail metrics you would pre-register for this rollout.
- Choose an explicit MDE, calculate the required sample size, and determine whether the test can be completed within 14 days.
- Specify the unit of randomization, allocation plan, duration, and any stratification or blocking you would use.
- Write the analysis plan: statistical test, peeking policy, multiple-comparison treatment, and how you will validate experiment integrity.
- State a clear ship / don't-ship / iterate decision rule that respects both the primary metric and the guardrails.
Assume the baseline weekly reorder conversion among eligible returning users is 18.0%. Historical data suggests each eligible user either places at least one reorder during the 7-day observation window or does not. You may assume a two-sided test with 5% significance and 80% power unless you justify a different choice.