You work on a digital product where an A/B test shows a clear improvement in the primary conversion metric for the treatment group. However, when you look at longer-term retention cohorts, the treatment appears worse than control. The team is unsure whether this is a real trade-off, noise from a lagging metric, or an analysis mistake.
How would you interpret an experiment where the primary metric improved, but the long-term retention cohort got worse? How would you investigate whether the result is causal and decide whether to ship?