Context
ShopLoop, a mobile commerce app, is testing a redesigned checkout page that simplifies shipping selection and highlights saved payment methods. The product manager asks a broader question than usual: what would make you trust or distrust the experiment result enough to ship?
Hypothesis Seed
The team believes the new checkout will reduce friction and increase completed purchases. Early design reviews suggest a likely lift in checkout completion, but there is concern that any apparent gain could be driven by instrumentation bugs, novelty effects, or harmful trade-offs such as higher refund rates.
Constraints
- Eligible traffic: 180,000 checkout starts per day
- 85% of traffic is on mobile app, 15% on mobile web
- Maximum experiment duration: 14 days
- Allocation can ramp, but the final design should support a ship decision within the 14-day window
- Baseline checkout completion rate: 48%
- Business wants to detect at least a 2.0% relative lift in checkout completion
- False positives are costly because a bad checkout experience directly hurts revenue; false negatives are acceptable if modest
- The team will monitor operational issues daily, but wants to avoid invalidating inference through peeking
Deliverables
- Define a clear null and alternative hypothesis, the primary metric, 2-4 guardrail metrics, and at least one secondary metric. Explicitly state what evidence would make you trust vs distrust the result.
- Calculate the required sample size using the stated baseline, alpha, power, and MDE. Translate that into expected runtime given available traffic.
- Choose the unit of randomization, allocation plan, duration, and any stratification. Explain why your design avoids contamination and supports valid inference.
- Pre-register an analysis plan: statistical test, peeking policy, handling of multiple comparisons, and checks for sample ratio mismatch or instrumentation issues.
- State a professional ship / don’t-ship / iterate rule that respects guardrails, and explain how you would interpret a statistically significant but practically small result.