You work on a consumer fintech growth team running an A/B test on a new onboarding flow intended to improve activation. The overall result is modest, but several cuts of the data appear to show larger lifts for certain user groups. You want to learn from heterogeneity without telling an overfit story based on noisy slices.
How would you think about segmenting results in this growth experiment without overfitting the story? How would you decide which segments are credible enough to influence the ship decision versus which should be treated as exploratory follow-up?