
You ran a 14-day user-level A/B test on an Instagram Reels ranking change. The treatment was intended to improve the AARRR activation-to-retention path by making Reels more engaging. In control, 520,000 users generated a day-7 retention rate of 31.2% (162,240 retained) and an average of 1.84 Instagram Saves per user with variance 10.24. In treatment, 518,000 users had day-7 retention of 30.4% (157,472 retained) and an average of 1.96 Saves per user with variance 10.89. Pre-experiment 7-day saves per user were available, and a CUPED adjustment reduced the variance of the save metric by 28%. An SRM check against a 50/50 split was not significant at the 5% level.
How would you analyze these conflicting results statistically and make a recommendation on whether to ship, hold back, or run follow-up experimentation?