Interview Guides

Design Cohort Model for Retention | Dataford Interview Questions - Dataford - Ace your Interview

Design Cohort Model for Retention

Medium

SQL & Data Manipulation

Data ModelingJoinsAggregationsSubqueriesData Wrangling

Problem

Context

You’re joining the analytics engineering team at a large e-commerce marketplace (~20M monthly active buyers, 200M+ orders/year). Product leaders want weekly retention and repeat-purchase cohort dashboards that refresh daily and power both executive reporting and experimentation analysis. Today, analysts compute cohorts ad hoc from raw clickstream and orders tables, which is slow, inconsistent (different cohort definitions), and expensive to run.

Core Question

Explain how you would approach designing a data model for cohort analysis that supports common questions like:

“For users acquired in a given week, what % placed a second order in week 1, week 2, … week 12?”
“How does retention differ by acquisition channel and device?”
“How do we handle late-arriving events and backfills without breaking historical cohort numbers?”

Your answer should cover:

Grain and keys: What is the primary entity (user, account, device)? What is the grain of the cohort table(s)?
Cohort definition: First app open vs first purchase vs first paid subscription; how you’d store multiple cohort types.
Time indexing: How you’d represent “cohort week” / “age” (e.g., week_number since cohort start) and calendar alignment.
SQL usability: How the model enables simple SQL using joins, aggregations, and window functions.
Data quality & correctness: Deduplication, timezone handling, bot filtering, refunds/cancellations, and late-arriving data.

Scope Guidance (what the interviewer expects)

Discuss trade-offs between (a) building a single wide cohort fact table vs (b) a normalized star schema with a cohort dimension + activity fact. Include at least one concrete example query shape you’re optimizing for (e.g., retention matrix) and call out performance considerations (partitioning, clustering, incremental materializations).

Key Concepts

Cohort assignment (first-known event)

Cohort analysis starts by assigning each entity (usually user_id) to a cohort based on the earliest qualifying event (e.g., first_purchase_at). Persisting this assignment avoids recomputing MIN(event_time) across massive event tables for every dashboard.

SELECT user_id, MIN(order_ts) AS first_purchase_at
FROM orders
WHERE order_status = 'paid'
GROUP BY user_id;

Cohort period vs activity period (two time axes)

A cohort model must represent both the cohort’s start period (e.g., cohort_week) and the activity period (e.g., activity_week) or an age index (week_number since cohort start). This enables retention curves and matrices without ambiguous date math.

DATE_TRUNC('week', first_purchase_at) AS cohort_week,
DATE_DIFF('week', DATE_TRUNC('week', first_purchase_at), DATE_TRUNC('week', order_ts)) AS cohort_age_week

Fact table grain for retention

A common, SQL-friendly grain is one row per (cohort_key, cohort_age, segment_dims) with measures like active_users, revenue, orders. This pre-aggregation makes dashboards fast and prevents analysts from accidentally changing definitions.

SELECT cohort_week, cohort_age_week, acquisition_channel,
       COUNT(DISTINCT user_id) AS active_users
FROM user_week_activity
GROUP BY 1,2,3;

Incremental builds and late-arriving data

Cohort tables must be rebuildable for a bounded window (e.g., last 30 days of activity) while keeping cohort assignment stable. Use an immutable cohort assignment table plus an incrementally updated activity fact; backfills update only affected partitions.

/* Pseudocode: rebuild only recent activity partitions */
DELETE FROM cohort_retention_fact
WHERE activity_week >= DATE_TRUNC('week', CURRENT_DATE - INTERVAL '30 day');

Window functions for retention curves

Window functions help compute denominators (cohort size) and derived metrics (retention_rate) without self-joins. Modeling data so cohort_size is easily accessible (or precomputed) simplifies these calculations.

retention_rate = active_users::DECIMAL
  / FIRST_VALUE(active_users) OVER (PARTITION BY cohort_week ORDER BY cohort_age_week)

Problem

Context

Core Question

Explain how you would approach designing a data model for cohort analysis that supports common questions like:

“For users acquired in a given week, what % placed a second order in week 1, week 2, … week 12?”
“How does retention differ by acquisition channel and device?”
“How do we handle late-arriving events and backfills without breaking historical cohort numbers?”

Your answer should cover:

Grain and keys: What is the primary entity (user, account, device)? What is the grain of the cohort table(s)?
Cohort definition: First app open vs first purchase vs first paid subscription; how you’d store multiple cohort types.
Time indexing: How you’d represent “cohort week” / “age” (e.g., week_number since cohort start) and calendar alignment.
SQL usability: How the model enables simple SQL using joins, aggregations, and window functions.
Data quality & correctness: Deduplication, timezone handling, bot filtering, refunds/cancellations, and late-arriving data.

Scope Guidance (what the interviewer expects)

Key Concepts

Cohort assignment (first-known event)

SELECT user_id, MIN(order_ts) AS first_purchase_at
FROM orders
WHERE order_status = 'paid'
GROUP BY user_id;

Cohort period vs activity period (two time axes)

DATE_TRUNC('week', first_purchase_at) AS cohort_week,
DATE_DIFF('week', DATE_TRUNC('week', first_purchase_at), DATE_TRUNC('week', order_ts)) AS cohort_age_week

Fact table grain for retention

SELECT cohort_week, cohort_age_week, acquisition_channel,
       COUNT(DISTINCT user_id) AS active_users
FROM user_week_activity
GROUP BY 1,2,3;

Incremental builds and late-arriving data

/* Pseudocode: rebuild only recent activity partitions */
DELETE FROM cohort_retention_fact
WHERE activity_week >= DATE_TRUNC('week', CURRENT_DATE - INTERVAL '30 day');

Window functions for retention curves

retention_rate = active_users::DECIMAL
  / FIRST_VALUE(active_users) OVER (PARTITION BY cohort_week ORDER BY cohort_age_week)

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Structure Retention Cohort AnalysisMedium ABuild Monthly Retention Cohort TableHard

Assess Durable Growth With CohortsMedium

Next question

Design Cohort Model for Retention

Medium

SQL & Data Manipulation

Data ModelingJoinsAggregationsSubqueriesData Wrangling

Problem

Context

Core Question

Explain how you would approach designing a data model for cohort analysis that supports common questions like:

“For users acquired in a given week, what % placed a second order in week 1, week 2, … week 12?”
“How does retention differ by acquisition channel and device?”
“How do we handle late-arriving events and backfills without breaking historical cohort numbers?”

Your answer should cover:

Grain and keys: What is the primary entity (user, account, device)? What is the grain of the cohort table(s)?
Cohort definition: First app open vs first purchase vs first paid subscription; how you’d store multiple cohort types.
Time indexing: How you’d represent “cohort week” / “age” (e.g., week_number since cohort start) and calendar alignment.
SQL usability: How the model enables simple SQL using joins, aggregations, and window functions.
Data quality & correctness: Deduplication, timezone handling, bot filtering, refunds/cancellations, and late-arriving data.

Scope Guidance (what the interviewer expects)

Key Concepts

Cohort assignment (first-known event)

SELECT user_id, MIN(order_ts) AS first_purchase_at
FROM orders
WHERE order_status = 'paid'
GROUP BY user_id;

Cohort period vs activity period (two time axes)

DATE_TRUNC('week', first_purchase_at) AS cohort_week,
DATE_DIFF('week', DATE_TRUNC('week', first_purchase_at), DATE_TRUNC('week', order_ts)) AS cohort_age_week

Fact table grain for retention

SELECT cohort_week, cohort_age_week, acquisition_channel,
       COUNT(DISTINCT user_id) AS active_users
FROM user_week_activity
GROUP BY 1,2,3;

Incremental builds and late-arriving data

/* Pseudocode: rebuild only recent activity partitions */
DELETE FROM cohort_retention_fact
WHERE activity_week >= DATE_TRUNC('week', CURRENT_DATE - INTERVAL '30 day');

Window functions for retention curves

retention_rate = active_users::DECIMAL
  / FIRST_VALUE(active_users) OVER (PARTITION BY cohort_week ORDER BY cohort_age_week)

Problem

Context

Core Question

Explain how you would approach designing a data model for cohort analysis that supports common questions like:

“For users acquired in a given week, what % placed a second order in week 1, week 2, … week 12?”
“How does retention differ by acquisition channel and device?”
“How do we handle late-arriving events and backfills without breaking historical cohort numbers?”

Your answer should cover:

Grain and keys: What is the primary entity (user, account, device)? What is the grain of the cohort table(s)?
Cohort definition: First app open vs first purchase vs first paid subscription; how you’d store multiple cohort types.
Time indexing: How you’d represent “cohort week” / “age” (e.g., week_number since cohort start) and calendar alignment.
SQL usability: How the model enables simple SQL using joins, aggregations, and window functions.
Data quality & correctness: Deduplication, timezone handling, bot filtering, refunds/cancellations, and late-arriving data.

Scope Guidance (what the interviewer expects)

Key Concepts

Cohort assignment (first-known event)

SELECT user_id, MIN(order_ts) AS first_purchase_at
FROM orders
WHERE order_status = 'paid'
GROUP BY user_id;

Cohort period vs activity period (two time axes)

DATE_TRUNC('week', first_purchase_at) AS cohort_week,
DATE_DIFF('week', DATE_TRUNC('week', first_purchase_at), DATE_TRUNC('week', order_ts)) AS cohort_age_week

Fact table grain for retention

SELECT cohort_week, cohort_age_week, acquisition_channel,
       COUNT(DISTINCT user_id) AS active_users
FROM user_week_activity
GROUP BY 1,2,3;

Incremental builds and late-arriving data

/* Pseudocode: rebuild only recent activity partitions */
DELETE FROM cohort_retention_fact
WHERE activity_week >= DATE_TRUNC('week', CURRENT_DATE - INTERVAL '30 day');

Window functions for retention curves

retention_rate = active_users::DECIMAL
  / FIRST_VALUE(active_users) OVER (PARTITION BY cohort_week ORDER BY cohort_age_week)

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Structure Retention Cohort AnalysisMedium ABuild Monthly Retention Cohort TableHard

Assess Durable Growth With CohortsMedium

Next question