Design Cohort Model for Retention

Context

You’re joining the analytics engineering team at a large e-commerce marketplace (~20M monthly active buyers, 200M+ orders/year). Product leaders want weekly retention and repeat-purchase cohort dashboards that refresh daily and power both executive reporting and experimentation analysis. Today, analysts compute cohorts ad hoc from raw clickstream and orders tables, which is slow, inconsistent (different cohort definitions), and expensive to run.

Core Question

Explain how you would approach designing a data model for cohort analysis that supports common questions like:

“For users acquired in a given week, what % placed a second order in week 1, week 2, … week 12?”
“How does retention differ by acquisition channel and device?”
“How do we handle late-arriving events and backfills without breaking historical cohort numbers?”

Your answer should cover:

Grain and keys: What is the primary entity (user, account, device)? What is the grain of the cohort table(s)?
Cohort definition: First app open vs first purchase vs first paid subscription; how you’d store multiple cohort types.
Time indexing: How you’d represent “cohort week” / “age” (e.g., week_number since cohort start) and calendar alignment.
SQL usability: How the model enables simple SQL using joins, aggregations, and window functions.
Data quality & correctness: Deduplication, timezone handling, bot filtering, refunds/cancellations, and late-arriving data.

Scope Guidance (what the interviewer expects)

Discuss trade-offs between (a) building a single wide cohort fact table vs (b) a normalized star schema with a cohort dimension + activity fact. Include at least one concrete example query shape you’re optimizing for (e.g., retention matrix) and call out performance considerations (partitioning, clustering, incremental materializations).

Problem

Context

Core Question

Explain how you would approach designing a data model for cohort analysis that supports common questions like:

“For users acquired in a given week, what % placed a second order in week 1, week 2, … week 12?”
“How does retention differ by acquisition channel and device?”
“How do we handle late-arriving events and backfills without breaking historical cohort numbers?”

Your answer should cover:

Grain and keys: What is the primary entity (user, account, device)? What is the grain of the cohort table(s)?
Cohort definition: First app open vs first purchase vs first paid subscription; how you’d store multiple cohort types.
Time indexing: How you’d represent “cohort week” / “age” (e.g., week_number since cohort start) and calendar alignment.
SQL usability: How the model enables simple SQL using joins, aggregations, and window functions.
Data quality & correctness: Deduplication, timezone handling, bot filtering, refunds/cancellations, and late-arriving data.

Scope Guidance (what the interviewer expects)

Problem

Context

Core Question

Explain how you would approach designing a data model for cohort analysis that supports common questions like:

“For users acquired in a given week, what % placed a second order in week 1, week 2, … week 12?”
“How does retention differ by acquisition channel and device?”
“How do we handle late-arriving events and backfills without breaking historical cohort numbers?”

Your answer should cover:

Grain and keys: What is the primary entity (user, account, device)? What is the grain of the cohort table(s)?
Cohort definition: First app open vs first purchase vs first paid subscription; how you’d store multiple cohort types.
Time indexing: How you’d represent “cohort week” / “age” (e.g., week_number since cohort start) and calendar alignment.
SQL usability: How the model enables simple SQL using joins, aggregations, and window functions.
Data quality & correctness: Deduplication, timezone handling, bot filtering, refunds/cancellations, and late-arriving data.

Scope Guidance (what the interviewer expects)

Problem

Context

Core Question

Explain how you would approach designing a data model for cohort analysis that supports common questions like:

“For users acquired in a given week, what % placed a second order in week 1, week 2, … week 12?”
“How does retention differ by acquisition channel and device?”
“How do we handle late-arriving events and backfills without breaking historical cohort numbers?”

Your answer should cover:

Grain and keys: What is the primary entity (user, account, device)? What is the grain of the cohort table(s)?
Cohort definition: First app open vs first purchase vs first paid subscription; how you’d store multiple cohort types.
Time indexing: How you’d represent “cohort week” / “age” (e.g., week_number since cohort start) and calendar alignment.
SQL usability: How the model enables simple SQL using joins, aggregations, and window functions.
Data quality & correctness: Deduplication, timezone handling, bot filtering, refunds/cancellations, and late-arriving data.

Interview Guides

Problem

Context

Core Question

Scope Guidance (what the interviewer expects)

Problem

Context

Core Question

Scope Guidance (what the interviewer expects)

Design Cohort Model for Retention

Problem

Context

Core Question

Scope Guidance (what the interviewer expects)

Problem

Context

Core Question

Scope Guidance (what the interviewer expects)