Context
ShopNow, a mid-sized e-commerce app, tested a new checkout feature that pre-fills shipping details for logged-in users. The experiment result is statistically positive, but the observed lift is small, and leadership wants a rigorous framework for deciding whether the feature is worth launching.
Hypothesis Seed
The new pre-fill experience is expected to reduce checkout friction and slightly improve purchase conversion. However, it may also increase address errors, customer support contacts, or engineering complexity, so a statistically significant result alone is not enough to justify launch.
Constraints
- Eligible traffic: 120,000 checkout-starting users per day
- Only 70% of users are logged in and eligible for the feature
- Maximum experiment window: 14 days, after which the team must decide ship / iterate / drop
- Engineering estimates the feature is worth launching only if purchase conversion improves by at least 1.0% relative
- False positives are costly because rollback requires app review and customer support retraining
- False negatives are also costly because checkout is a major revenue funnel
Deliverables
- Define the null and alternative hypotheses, the primary metric, 2-4 guardrail metrics, and a clear MDE that reflects business value rather than just statistical detectability.
- Calculate the required sample size and determine whether the available traffic can support the test within 14 days. Show the math explicitly and state any assumptions.
- Choose the unit of randomization, allocation, duration, and any stratification. Explain how your design handles repeat visitors and cross-device behavior.
- Pre-register an analysis plan: statistical test, peeking policy, treatment of secondary metrics, SRM checks, and how you will interpret a result that is statistically significant but below the practical threshold.
- State a final decision rule for ship / don’t ship / iterate that respects both the primary metric and guardrails.