You’re joining the analytics engineering team at a large e-commerce marketplace (~20M monthly active buyers, 200M+ orders/year). Product leaders want weekly retention and repeat-purchase cohort dashboards that refresh daily and power both executive reporting and experimentation analysis. Today, analysts compute cohorts ad hoc from raw clickstream and orders tables, which is slow, inconsistent (different cohort definitions), and expensive to run.
Explain how you would approach designing a data model for cohort analysis that supports common questions like:
Your answer should cover:
Discuss trade-offs between (a) building a single wide cohort fact table vs (b) a normalized star schema with a cohort dimension + activity fact. Include at least one concrete example query shape you’re optimizing for (e.g., retention matrix) and call out performance considerations (partitioning, clustering, incremental materializations).