Significant But Don’t Ship

Context

The Facebook Feed team is testing a new comment composer treatment that makes it easier to leave a quick reaction-plus-comment on posts. Early internal results suggest it may increase commenting, but leadership wants a rigorous experiment before launch.

Hypothesis Seed

The product hypothesis is that reducing friction in the composer will increase meaningful comment creation per Feed viewer. However, the team is worried that a statistically significant lift in comments could still be a bad launch if it comes with lower session quality, more spammy interactions, or only a trivial gain that is not worth added complexity.

Constraints

Eligible traffic: 12 million Facebook Feed viewers per day globally
Maximum experiment duration: 14 days, because the team must make a launch decision before the next quarterly planning cycle
Allocation target: 50/50 after a brief instrumentation ramp
False positives are costly because low-quality engagement can hurt long-term Feed experience
False negatives are also costly because comment creation is a strategic engagement goal
The team will monitor safety metrics daily, but the primary readout must follow a pre-registered plan

Deliverables

Define the null and alternative hypotheses, including whether the primary test should be one-sided or two-sided.
Choose a primary metric, 2-4 guardrail metrics, and 1-3 secondary metrics. State the MDE explicitly and explain why a statistically significant result may still not justify shipping.
Calculate the required sample size per arm using the provided baseline assumptions, and translate that into expected runtime given available traffic.
Specify the experiment design: unit of randomization, allocation, duration, stratification, and how you will handle peeking, multiple comparisons, and any mismatch between unit of randomization and analysis.
State a clear ship / don’t-ship / iterate rule that respects guardrails and addresses pitfalls such as novelty effects, sample ratio mismatch, and interference across users.

Use these planning assumptions for the primary metric: baseline meaningful comment creation rate = 8.0% per viewer-day, target MDE = +2.0% relative, alpha = 0.05, power = 80%.

Context

Hypothesis Seed

Constraints

Eligible traffic: 12 million Facebook Feed viewers per day globally
Maximum experiment duration: 14 days, because the team must make a launch decision before the next quarterly planning cycle
Allocation target: 50/50 after a brief instrumentation ramp
False positives are costly because low-quality engagement can hurt long-term Feed experience
False negatives are also costly because comment creation is a strategic engagement goal
The team will monitor safety metrics daily, but the primary readout must follow a pre-registered plan

Deliverables

Define the null and alternative hypotheses, including whether the primary test should be one-sided or two-sided.
Choose a primary metric, 2-4 guardrail metrics, and 1-3 secondary metrics. State the MDE explicitly and explain why a statistically significant result may still not justify shipping.
Calculate the required sample size per arm using the provided baseline assumptions, and translate that into expected runtime given available traffic.
Specify the experiment design: unit of randomization, allocation, duration, stratification, and how you will handle peeking, multiple comparisons, and any mismatch between unit of randomization and analysis.
State a clear ship / don’t-ship / iterate rule that respects guardrails and addresses pitfalls such as novelty effects, sample ratio mismatch, and interference across users.

Use these planning assumptions for the primary metric: baseline meaningful comment creation rate = 8.0% per viewer-day, target MDE = +2.0% relative, alpha = 0.05, power = 80%.

Context

Hypothesis Seed

Constraints

Eligible traffic: 12 million Facebook Feed viewers per day globally
Maximum experiment duration: 14 days, because the team must make a launch decision before the next quarterly planning cycle
Allocation target: 50/50 after a brief instrumentation ramp
False positives are costly because low-quality engagement can hurt long-term Feed experience
False negatives are also costly because comment creation is a strategic engagement goal
The team will monitor safety metrics daily, but the primary readout must follow a pre-registered plan

Deliverables

Define the null and alternative hypotheses, including whether the primary test should be one-sided or two-sided.
Choose a primary metric, 2-4 guardrail metrics, and 1-3 secondary metrics. State the MDE explicitly and explain why a statistically significant result may still not justify shipping.
Calculate the required sample size per arm using the provided baseline assumptions, and translate that into expected runtime given available traffic.
Specify the experiment design: unit of randomization, allocation, duration, stratification, and how you will handle peeking, multiple comparisons, and any mismatch between unit of randomization and analysis.
State a clear ship / don’t-ship / iterate rule that respects guardrails and addresses pitfalls such as novelty effects, sample ratio mismatch, and interference across users.

Use these planning assumptions for the primary metric: baseline meaningful comment creation rate = 8.0% per viewer-day, target MDE = +2.0% relative, alpha = 0.05, power = 80%.

Context

Hypothesis Seed

Constraints

Eligible traffic: 12 million Facebook Feed viewers per day globally
Maximum experiment duration: 14 days, because the team must make a launch decision before the next quarterly planning cycle
Allocation target: 50/50 after a brief instrumentation ramp
False positives are costly because low-quality engagement can hurt long-term Feed experience
False negatives are also costly because comment creation is a strategic engagement goal
The team will monitor safety metrics daily, but the primary readout must follow a pre-registered plan

Deliverables

Define the null and alternative hypotheses, including whether the primary test should be one-sided or two-sided.
Choose a primary metric, 2-4 guardrail metrics, and 1-3 secondary metrics. State the MDE explicitly and explain why a statistically significant result may still not justify shipping.
Calculate the required sample size per arm using the provided baseline assumptions, and translate that into expected runtime given available traffic.
Specify the experiment design: unit of randomization, allocation, duration, stratification, and how you will handle peeking, multiple comparisons, and any mismatch between unit of randomization and analysis.
State a clear ship / don’t-ship / iterate rule that respects guardrails and addresses pitfalls such as novelty effects, sample ratio mismatch, and interference across users.

Use these planning assumptions for the primary metric: baseline meaningful comment creation rate = 8.0% per viewer-day, target MDE = +2.0% relative, alpha = 0.05, power = 80%.

Interview Guides

Context

Hypothesis Seed

Constraints

Deliverables

Significant But Don’t Ship

Context

Hypothesis Seed

Constraints

Deliverables

Your Answer

Significant But Don’t Ship

Context

Hypothesis Seed

Constraints

Deliverables

Significant But Don’t Ship

Context

Hypothesis Seed

Constraints

Deliverables

Your Answer