Analyze Pitfalls in Feed Test

Context

Streamly, a short-video app, is testing a redesigned home feed that surfaces more personalized creators. The product team wants to know not just whether the variant wins, but whether the experiment can be trusted.

Hypothesis Seed

The new feed ranking is expected to increase the rate at which users watch at least one video for 30+ seconds after opening the app, because recommendations should feel more relevant. However, the change could also hurt ad revenue, session stability, or create misleading results if the test is analyzed poorly.

Constraints

Eligible traffic: 600,000 daily active users per day
Maximum experiment duration: 14 days
Allocation target: 50/50 after a 1-day 5% ramp for instrumentation checks
Baseline 30-second watch-start conversion: 40%
Smallest business-relevant lift: 2% relative
False positives are costly because a bad ranking launch can reduce creator exposure and ad revenue; false negatives are acceptable if the team can rerun later

Deliverables

State the null and alternative hypotheses, define the primary metric, and choose 2-4 guardrail metrics with thresholds.
Compute the required sample size per arm using the stated baseline, alpha, power, and MDE, and translate that into expected runtime under the traffic constraint.
Choose the unit of randomization and explain how you would analyze the test if the metric is measured at a different level than randomization.
Pre-register an analysis plan covering the statistical test, peeking policy, multiple-comparisons treatment, and a ship / don't-ship rule.
Identify the main pitfalls you would watch for when analyzing this A/B test, including at least peeking, novelty effects, network interference or SUTVA concerns, and sample ratio mismatch, and explain how you would mitigate each.

Hypothesis Seed

Constraints

Eligible traffic: 600,000 daily active users per day

Maximum experiment duration: 14 days

Allocation target: 50/50 after a 1-day 5% ramp for instrumentation checks

Baseline 30-second watch-start conversion: 40%

Smallest business-relevant lift: 2% relative

False positives are costly because a bad ranking launch can reduce creator exposure and ad revenue; false negatives are acceptable if the team can rerun later

Deliverables

State the null and alternative hypotheses, define the primary metric, and choose 2-4 guardrail metrics with thresholds.

Compute the required sample size per arm using the stated baseline, alpha, power, and MDE, and translate that into expected runtime under the traffic constraint.

Choose the unit of randomization and explain how you would analyze the test if the metric is measured at a different level than randomization.

Pre-register an analysis plan covering the statistical test, peeking policy, multiple-comparisons treatment, and a ship / don't-ship rule.

Identify the main pitfalls you would watch for when analyzing this A/B test, including at least peeking, novelty effects, network interference or SUTVA concerns, and sample ratio mismatch, and explain how you would mitigate each.

Hypothesis Seed

Constraints

Eligible traffic: 600,000 daily active users per day

Maximum experiment duration: 14 days

Allocation target: 50/50 after a 1-day 5% ramp for instrumentation checks

Baseline 30-second watch-start conversion: 40%

Smallest business-relevant lift: 2% relative

False positives are costly because a bad ranking launch can reduce creator exposure and ad revenue; false negatives are acceptable if the team can rerun later

Deliverables

State the null and alternative hypotheses, define the primary metric, and choose 2-4 guardrail metrics with thresholds.

Compute the required sample size per arm using the stated baseline, alpha, power, and MDE, and translate that into expected runtime under the traffic constraint.

Choose the unit of randomization and explain how you would analyze the test if the metric is measured at a different level than randomization.

Pre-register an analysis plan covering the statistical test, peeking policy, multiple-comparisons treatment, and a ship / don't-ship rule.

Hypothesis Seed

Constraints

Eligible traffic: 600,000 daily active users per day

Maximum experiment duration: 14 days

Allocation target: 50/50 after a 1-day 5% ramp for instrumentation checks

Baseline 30-second watch-start conversion: 40%

Smallest business-relevant lift: 2% relative

False positives are costly because a bad ranking launch can reduce creator exposure and ad revenue; false negatives are acceptable if the team can rerun later

Deliverables

State the null and alternative hypotheses, define the primary metric, and choose 2-4 guardrail metrics with thresholds.

Compute the required sample size per arm using the stated baseline, alpha, power, and MDE, and translate that into expected runtime under the traffic constraint.

Choose the unit of randomization and explain how you would analyze the test if the metric is measured at a different level than randomization.

Pre-register an analysis plan covering the statistical test, peeking policy, multiple-comparisons treatment, and a ship / don't-ship rule.

Interview Guides

Context

Hypothesis Seed

Constraints

Deliverables

Analyze Pitfalls in Feed Test

Context

Hypothesis Seed

Constraints

Deliverables

Your Answer

Analyze Pitfalls in Feed Test

Context

Hypothesis Seed

Constraints

Deliverables

Analyze Pitfalls in Feed Test

Context

Hypothesis Seed

Constraints

Deliverables

Your Answer