Context
StreamCart, a grocery delivery app, is testing a redesigned checkout page intended to reduce friction and increase order completion. Three days into the experiment, the treatment shows a large conversion lift on mobile Safari, a drop on Android, and an unexpected 56/44 traffic split instead of 50/50.
Hypothesis Seed
The new checkout design shortens the form and highlights saved payment methods. The product team believes it will improve checkout completion without increasing payment failures or customer support contacts. Because the data already looks suspicious, the key challenge is not only designing the experiment, but deciding how to handle inconsistent or potentially invalid results.
Constraints
- Eligible traffic: 120,000 checkout starts per day
- Maximum runtime: 14 days; leadership needs a decision before a seasonal campaign
- Planned allocation: 50/50 after a 5% canary for instrumentation checks
- Baseline checkout completion rate: 38%
- Small false positives are costly because a broken checkout directly impacts revenue; false negatives are acceptable if they avoid shipping a buggy experience
Deliverables
- Define the hypothesis, primary metric, guardrails, and a realistic MDE for this test.
- Calculate the required sample size and whether the test can be completed within 14 days.
- Choose the unit of randomization and explain how you would investigate suspicious data before trusting the result (for example SRM, logging bugs, or segment-specific anomalies).
- Pre-register the analysis plan: test choice, peeking policy, multiple-comparison handling, and what to do if data quality checks fail.
- Provide a clear ship / don’t-ship / investigate decision rule that respects guardrails and explicitly handles the case where the experiment is statistically significant but the data appears inconsistent.