Primary Win, Guardrail Loss

Context

Streamly, a video streaming app, is testing a redesigned autoplay-next panel on the episode end screen. Product leadership wants to know how to act if the change improves the main conversion metric but harms a safety metric.

Hypothesis Seed

The new panel makes the next episode more prominent and is expected to increase the share of episode completions that lead to another episode start. However, the team worries it may also increase early-session exits from annoyance or make the app feel pushy.

Constraints

Eligible traffic: 240,000 episode-end impressions per day
Randomization must be at the user_id level to avoid a user seeing both designs in one binge session
Maximum experiment duration: 14 days
Allocation target: 50/50 after a 1-day 5% ramp for instrumentation checks
False positives are costly because a bad rollout could hurt long-term retention; false negatives are acceptable if the effect is small
The team will only ship if the gain is practically meaningful and no pre-registered guardrail is materially harmed

Task

Define the hypothesis, primary metric, guardrails, and a clear minimum detectable effect (MDE).
Calculate the required sample size and determine whether the test can be completed within the 14-day limit using the available traffic.
Choose the experiment design: unit of randomization, allocation, duration, and any stratification or blocking.
Pre-register the analysis plan, including the statistical test, peeking policy, handling of multiple metrics, and how to deal with user-level randomization but event-level outcomes.
State an explicit ship / don’t-ship / iterate rule for the case where the primary metric improves but a guardrail metric gets worse. Be specific about how large a guardrail decline is tolerable, if any, and how you would interpret statistical vs practical significance.

Constraints

Eligible traffic: 240,000 episode-end impressions per day

Randomization must be at the user_id level to avoid a user seeing both designs in one binge session

Maximum experiment duration: 14 days

Allocation target: 50/50 after a 1-day 5% ramp for instrumentation checks

False positives are costly because a bad rollout could hurt long-term retention; false negatives are acceptable if the effect is small

The team will only ship if the gain is practically meaningful and no pre-registered guardrail is materially harmed

Task

Define the hypothesis, primary metric, guardrails, and a clear minimum detectable effect (MDE).

Calculate the required sample size and determine whether the test can be completed within the 14-day limit using the available traffic.

Choose the experiment design: unit of randomization, allocation, duration, and any stratification or blocking.

Pre-register the analysis plan, including the statistical test, peeking policy, handling of multiple metrics, and how to deal with user-level randomization but event-level outcomes.

State an explicit ship / don’t-ship / iterate rule for the case where the primary metric improves but a guardrail metric gets worse. Be specific about how large a guardrail decline is tolerable, if any, and how you would interpret statistical vs practical significance.

Constraints

Eligible traffic: 240,000 episode-end impressions per day

Randomization must be at the user_id level to avoid a user seeing both designs in one binge session

Maximum experiment duration: 14 days

Allocation target: 50/50 after a 1-day 5% ramp for instrumentation checks

False positives are costly because a bad rollout could hurt long-term retention; false negatives are acceptable if the effect is small

The team will only ship if the gain is practically meaningful and no pre-registered guardrail is materially harmed

Task

Define the hypothesis, primary metric, guardrails, and a clear minimum detectable effect (MDE).

Calculate the required sample size and determine whether the test can be completed within the 14-day limit using the available traffic.

Choose the experiment design: unit of randomization, allocation, duration, and any stratification or blocking.

Pre-register the analysis plan, including the statistical test, peeking policy, handling of multiple metrics, and how to deal with user-level randomization but event-level outcomes.

Constraints

Eligible traffic: 240,000 episode-end impressions per day

Randomization must be at the user_id level to avoid a user seeing both designs in one binge session

Maximum experiment duration: 14 days

Allocation target: 50/50 after a 1-day 5% ramp for instrumentation checks

False positives are costly because a bad rollout could hurt long-term retention; false negatives are acceptable if the effect is small

The team will only ship if the gain is practically meaningful and no pre-registered guardrail is materially harmed

Task

Define the hypothesis, primary metric, guardrails, and a clear minimum detectable effect (MDE).

Calculate the required sample size and determine whether the test can be completed within the 14-day limit using the available traffic.

Choose the experiment design: unit of randomization, allocation, duration, and any stratification or blocking.

Pre-register the analysis plan, including the statistical test, peeking policy, handling of multiple metrics, and how to deal with user-level randomization but event-level outcomes.

Interview Guides

Context

Hypothesis Seed

Constraints

Task

Primary Win, Guardrail Loss

Context

Hypothesis Seed

Constraints

Task

Your Answer

Primary Win, Guardrail Loss

Context

Hypothesis Seed

Constraints

Task

Primary Win, Guardrail Loss

Context

Hypothesis Seed

Constraints

Task

Your Answer