
You ran a 14-day user-level A/B test on an Instagram Reels ranking change. The treatment was intended to improve the AARRR activation-to-retention path by making Reels more engaging. In control, 520,000 users generated a day-7 retention rate of 31.2% (162,240 retained) and an average of 1.84 Instagram Saves per user with variance 10.24. In treatment, 518,000 users had day-7 retention of 30.4% (157,472 retained) and an average of 1.96 Saves per user with variance 10.89. Pre-experiment 7-day saves per user were available, and a CUPED adjustment reduced the variance of the save metric by 28%. An SRM check against a 50/50 split was not significant at the 5% level.
How would you analyze these conflicting results statistically and make a recommendation on whether to ship, hold back, or run follow-up experimentation?
{"alpha":0.05,"test_days":14,"control_users":520000,"treatment_users":518000,"control_save_mean":1.84,"treatment_save_mean":1.96,"control_day7_retained":162240,"control_save_variance":10.24,"treatment_day7_retained":157472,"treatment_save_variance":10.89,"cuped_variance_reduction":0.28,"expected_treatment_share":0.5,"control_day7_retention_rate":0.312,"treatment_day7_retention_rate":0.304}Output(none)