Validate Reels Ranking Watch-Time Lift

Context

The Instagram Reels ranking team has launched an A/B test in Meta’s experimentation platform for a new Reels ranker. Early reads show +3% watch time in treatment, and leadership wants to know whether this is a real product win or just noise.

Hypothesis Seed

The new ranker reorders IG Reels using updated engagement features, including stronger weighting on predicted long-watch and IG Save propensity. The team believes this should improve the AARRR engagement layer for existing users by increasing Reels watch time without hurting downstream quality signals such as session exits, negative feedback, or creator ecosystem health.

Constraints

Eligible traffic: 12M daily active Reels viewers/day globally
Max experiment duration: 14 days; a ship/no-ship decision is required by then
Initial ramp: 5% for 1 day, then 50/50 split if no severe issues
False positives are costly because a bad ranker can degrade user experience at Meta scale; false negatives are also costly because ranking launches are expensive and delayed launches slow Reels growth
You may use CUPED with 7-day pre-experiment watch-time data if you justify it
Assume baseline daily per-user Reels watch time is 24.0 minutes, with standard deviation 60 minutes due to heavy-tailed usage

Task

State the null and alternative hypotheses for whether the observed +3% watch-time lift is real, and define the primary metric, guardrails, and at least one secondary metric using Meta vocabulary.
Choose the unit of randomization and unit of analysis, explain whether they should match, and discuss risks such as network interference, SUTVA violations, and novelty/primacy effects on IG Reels.
Compute the required sample size for a pre-registered MDE, translate it into runtime using the traffic above, and explain whether a 14-day test is sufficient. If you use CUPED, state how it changes variance assumptions.
Write the analysis plan: test statistic, confidence interval, peeking policy, multiple-comparison policy for guardrails/secondaries, and how you will check for Sample Ratio Mismatch (SRM).
Give a clear ship / don’t ship / iterate decision rule for the case where watch time is up 3% but one or more guardrails move in the wrong direction.

Hypothesis Seed

Constraints

Eligible traffic: 12M daily active Reels viewers/day globally

Max experiment duration: 14 days; a ship/no-ship decision is required by then

Initial ramp: 5% for 1 day, then 50/50 split if no severe issues

False positives are costly because a bad ranker can degrade user experience at Meta scale; false negatives are also costly because ranking launches are expensive and delayed launches slow Reels growth

You may use CUPED with 7-day pre-experiment watch-time data if you justify it

Assume baseline daily per-user Reels watch time is 24.0 minutes, with standard deviation 60 minutes due to heavy-tailed usage

Task

State the null and alternative hypotheses for whether the observed +3% watch-time lift is real, and define the primary metric, guardrails, and at least one secondary metric using Meta vocabulary.

Choose the unit of randomization and unit of analysis, explain whether they should match, and discuss risks such as network interference, SUTVA violations, and novelty/primacy effects on IG Reels.

Compute the required sample size for a pre-registered MDE, translate it into runtime using the traffic above, and explain whether a 14-day test is sufficient. If you use CUPED, state how it changes variance assumptions.

Write the analysis plan: test statistic, confidence interval, peeking policy, multiple-comparison policy for guardrails/secondaries, and how you will check for Sample Ratio Mismatch (SRM).

Give a clear ship / don’t ship / iterate decision rule for the case where watch time is up 3% but one or more guardrails move in the wrong direction.

Hypothesis Seed

Constraints

Eligible traffic: 12M daily active Reels viewers/day globally

Max experiment duration: 14 days; a ship/no-ship decision is required by then

Initial ramp: 5% for 1 day, then 50/50 split if no severe issues

You may use CUPED with 7-day pre-experiment watch-time data if you justify it

Assume baseline daily per-user Reels watch time is 24.0 minutes, with standard deviation 60 minutes due to heavy-tailed usage

Task

Write the analysis plan: test statistic, confidence interval, peeking policy, multiple-comparison policy for guardrails/secondaries, and how you will check for Sample Ratio Mismatch (SRM).

Give a clear ship / don’t ship / iterate decision rule for the case where watch time is up 3% but one or more guardrails move in the wrong direction.

Hypothesis Seed

Constraints

Eligible traffic: 12M daily active Reels viewers/day globally

Max experiment duration: 14 days; a ship/no-ship decision is required by then

Initial ramp: 5% for 1 day, then 50/50 split if no severe issues

You may use CUPED with 7-day pre-experiment watch-time data if you justify it

Assume baseline daily per-user Reels watch time is 24.0 minutes, with standard deviation 60 minutes due to heavy-tailed usage

Task

Write the analysis plan: test statistic, confidence interval, peeking policy, multiple-comparison policy for guardrails/secondaries, and how you will check for Sample Ratio Mismatch (SRM).

Give a clear ship / don’t ship / iterate decision rule for the case where watch time is up 3% but one or more guardrails move in the wrong direction.

Interview Guides

Context

Hypothesis Seed

Constraints

Task

Validate Reels Ranking Watch-Time Lift

Context

Hypothesis Seed

Constraints

Task

Your Answer

Validate Reels Ranking Watch-Time Lift

Context

Hypothesis Seed

Constraints

Task

Validate Reels Ranking Watch-Time Lift

Context

Hypothesis Seed

Constraints

Task

Your Answer