
You work on a short-form video product and are testing a new Instagram Reels ranking treatment that surfaces more “save-worthy” content earlier in the session. The team believes this will improve the AARRR activation-to-retention path by increasing the Instagram Save rate, but there is concern that heavier or more niche content could reduce downstream sharing and virality. Early readouts from internal dashboards suggest the key metric is up while at least one guardrail is down.
How would you design and analyze this experiment so you can make a defensible ship decision if Instagram Save improves but a guardrail metric worsens? Explain how you would set the hypothesis, choose randomization, size the test, pre-register the analysis, and handle risks like novelty effect, SRM, and interference from social sharing.