Context
InboxPro, an email productivity app, wants to launch Smart Compose, a feature that suggests the rest of a sentence while a user writes an email. Product leadership wants a rigorous online experiment before rollout.
Hypothesis Seed
The team believes Smart Compose will reduce email drafting friction, leading to more sent emails per active user and faster compose completion. However, there is risk that low-quality suggestions could frustrate users, increase undo/delete behavior, or reduce reply quality.
Constraints
- Eligible traffic: 120,000 daily active compose users/day
- Only 60% of DAU are eligible because the feature is available only in English on mobile and desktop
- Maximum experiment duration: 14 days
- Randomization must be stable for the same user across sessions
- False positives are costly because poor suggestions could damage trust and retention; false negatives are acceptable if the feature can be iterated and re-tested next month
- Engineering asks for a clear ship/no-ship decision by the end of the 14-day window
Deliverables
- Define the experiment hypothesis, the primary metric, and 2-4 guardrail metrics. Be explicit about the metric formula, unit of analysis, baseline, and minimum detectable effect (MDE).
- Calculate the required sample size per arm using explicit assumptions for alpha, power, baseline rate/variance, and MDE. Translate that into expected runtime given the available traffic.
- Choose the unit of randomization and justify it. Explain whether the analysis unit should match the randomization unit and how you would handle any mismatch.
- Pre-register the analysis plan: statistical test, peeking policy, treatment of secondary metrics, and how you will check for sample ratio mismatch.
- State a professional decision rule for ship / don’t ship / iterate that respects guardrails, and discuss key pitfalls such as novelty effects, interference across users, and SUTVA violations.
Assume the current baseline send-through rate per compose session is 0.38, and Smart Compose is expected to improve it by roughly 2% relative if it works. Also assume each eligible user starts 1.8 compose sessions/day on average.