You’re joining the analytics engineering team at a large e-commerce marketplace (~20M monthly active buyers, 200M+ orders/year). Product leaders want weekly retention and repeat-purchase cohort dashboards that refresh daily and power both executive reporting and experimentation analysis. Today, analysts compute cohorts ad hoc from raw clickstream and orders tables, which is slow, inconsistent (different cohort definitions), and expensive to run.
Explain how you would approach designing a data model for cohort analysis that supports common questions like:
Your answer should cover:
Discuss trade-offs between (a) building a single wide cohort fact table vs (b) a normalized star schema with a cohort dimension + activity fact. Include at least one concrete example query shape you’re optimizing for (e.g., retention matrix) and call out performance considerations (partitioning, clustering, incremental materializations).
Cohort analysis starts by assigning each entity (usually user_id) to a cohort based on the earliest qualifying event (e.g., first_purchase_at). Persisting this assignment avoids recomputing MIN(event_time) across massive event tables for every dashboard.
SELECT user_id, MIN(order_ts) AS first_purchase_at
FROM orders
WHERE order_status = 'paid'
GROUP BY user_id;
A cohort model must represent both the cohort’s start period (e.g., cohort_week) and the activity period (e.g., activity_week) or an age index (week_number since cohort start). This enables retention curves and matrices without ambiguous date math.
DATE_TRUNC('week', first_purchase_at) AS cohort_week,
DATE_DIFF('week', DATE_TRUNC('week', first_purchase_at), DATE_TRUNC('week', order_ts)) AS cohort_age_week
A common, SQL-friendly grain is one row per (cohort_key, cohort_age, segment_dims) with measures like active_users, revenue, orders. This pre-aggregation makes dashboards fast and prevents analysts from accidentally changing definitions.
SELECT cohort_week, cohort_age_week, acquisition_channel,
COUNT(DISTINCT user_id) AS active_users
FROM user_week_activity
GROUP BY 1,2,3;
Cohort tables must be rebuildable for a bounded window (e.g., last 30 days of activity) while keeping cohort assignment stable. Use an immutable cohort assignment table plus an incrementally updated activity fact; backfills update only affected partitions.
/* Pseudocode: rebuild only recent activity partitions */
DELETE FROM cohort_retention_fact
WHERE activity_week >= DATE_TRUNC('week', CURRENT_DATE - INTERVAL '30 day');
Window functions help compute denominators (cohort size) and derived metrics (retention_rate) without self-joins. Modeling data so cohort_size is easily accessible (or precomputed) simplifies these calculations.
retention_rate = active_users::DECIMAL
/ FIRST_VALUE(active_users) OVER (PARTITION BY cohort_week ORDER BY cohort_age_week)