Context
ShopNow, a mid-sized e-commerce app, tested a redesigned mobile checkout intended to reduce friction and increase completed purchases. The first A/B test was declared inconclusive after stakeholders questioned both the setup and the interpretation.
Hypothesis Seed
The new checkout compresses the flow from 4 screens to 2 and auto-fills saved shipping information. Product believes this should improve purchase conversion among users who start checkout, but there is concern that the redesign could create confusion, increase payment failures, or produce a short-lived novelty bump.
Constraints
- Eligible traffic: 120,000 mobile users per day who start checkout
- Current baseline purchase conversion from checkout start: 38%
- Maximum experiment duration: 14 days; leadership needs a decision before a seasonal campaign
- Randomization must happen at the user level because users may have multiple sessions
- False positives are costly because a bad checkout harms revenue immediately; false negatives are acceptable but still undesirable because engineering has already invested 6 weeks
- You may assume 50/50 allocation after a small instrumentation ramp
Task
You are asked to redesign this experiment and explain what should have been learned from the failed first attempt.
- State the null and alternative hypotheses, define the primary metric, 2-4 guardrails, and an explicit MDE that is worth shipping.
- Calculate the required sample size per arm using the given baseline and your chosen MDE, then translate that into expected runtime under the traffic constraint.
- Choose the unit of randomization, allocation plan, duration, and any stratification. Explain how you would avoid common failure modes such as peeking, novelty effects, and sample ratio mismatch.
- Pre-register the analysis plan: statistical test, handling of secondary metrics, multiple comparisons policy, and what you will do if the unit of analysis differs from the unit of randomization.
- Give a clear ship / don’t-ship / iterate rule for outcomes such as: significant lift with guardrail breach, non-significant result, or statistically significant but practically tiny improvement.
Be concrete. Show the math, not just the framework.