Context
AdSphere runs sponsored listings in a large marketplace app. The ads team has trained a new ad-ranking model that is expected to improve ad quality and monetization, and wants a rigorous online experiment before launch.
Hypothesis Seed
The new model uses richer user-context and conversion features, and the team believes it will increase revenue per 1,000 ad impressions by showing more relevant ads. However, leadership is concerned that a model optimized too aggressively for revenue could hurt user engagement, advertiser fairness, or system latency.
Constraints
- Eligible traffic: 12 million ad impressions/day across homepage, search, and feed surfaces
- Average baseline click-through rate (CTR): 1.8%
- Average baseline revenue per 1,000 impressions (RPM): $24.00, with impression-level standard deviation approximately $120 when expressed on a per-1,000-impression normalized basis
- Maximum experiment duration: 14 days
- Randomization must be chosen to avoid users seeing inconsistent ad ordering within a session
- False positives are expensive because a bad launch can reduce marketplace trust and advertiser ROI; false negatives are acceptable but should be minimized
Deliverables
- Define the experiment hypothesis, including the primary metric, 2-4 guardrails, and a clear minimum detectable effect (MDE).
- Compute the required sample size and estimate whether the test can be completed within the 14-day traffic budget.
- Choose the unit of randomization, allocation, and duration; explain trade-offs such as user-level consistency, interference, and variance.
- Pre-register an analysis plan: statistical test, treatment of multiple metrics, peeking policy, and how you will check for sample ratio mismatch.
- State a clear ship / don’t ship / iterate rule that respects both the primary metric and guardrails.
Assume you may use asymptotic approximations, but your design should be robust enough for production decision-making. Be explicit about how you would handle pitfalls such as novelty effects, network interference from advertisers adapting bids, and any mismatch between unit of randomization and unit of analysis.