Context
StreamCart, a grocery delivery app, recently tested a redesigned checkout page intended to reduce friction and increase completed orders. The first experiment finished without a statistically significant result, and leadership is asking whether the team should rerun it or move on.
Hypothesis Seed
The redesign shortens the checkout flow from 4 steps to 3 and surfaces saved payment methods earlier. Product believes this should improve checkout completion, but the prior test showed a small positive point estimate with wide confidence intervals and no clear decision.
Constraints
- Eligible traffic: 180,000 checkout starts per day
- Maximum additional runtime if rerun: 14 days
- Prior baseline checkout completion rate: 48%
- Small engineering cost to rerun, but a false positive is expensive because a worse checkout experience can reduce revenue and trust
- The team wants a decision framework for when an inconclusive result justifies a rerun versus when it indicates the test was underpowered, poorly designed, or simply not impactful enough
Task
- Define the null and alternative hypotheses, the primary metric, 2-4 guardrails, and an explicit MDE that would justify shipping.
- Calculate the required sample size and duration for a rerun using the stated traffic, showing the math and explaining whether the prior inconclusive result should trigger a rerun.
- Propose the experiment design: unit of randomization, allocation, duration, and any stratification or variance-reduction choices.
- Pre-register the analysis plan, including the statistical test, peeking policy, multiple-comparisons policy, and how you would diagnose issues such as sample ratio mismatch.
- Give a clear ship / don’t ship / rerun decision rule. Explain what you would do if the rerun is statistically significant but the observed lift is smaller than the MDE, or if the primary metric improves while guardrails worsen.