Dataford
Interview Guides
Upgrade
All questions/SQL & Data Manipulation/Structure Retention Cohort Analysis

Structure Retention Cohort Analysis

Medium
SQL & Data Manipulation
Asked at 6 companies6AggregationsCTEsGroup ByDate Functions
Also asked at
HuluQuoraHelloFreshAViberCaesars entertainment

Problem

Context

Cohort retention is a common product analytics pattern used to measure how well users return after signup or first activity. Interviewers ask this to evaluate whether you can translate a business metric into a clean SQL workflow.

Core question

Explain how you would structure a SQL analysis to measure retention by cohort. Your answer should cover:

  1. How to define the cohort date for each user
  2. How to identify retained users in later periods
  3. How to calculate the period offset from the cohort start
  4. How to aggregate results into a cohort retention table or matrix
  5. How to handle edge cases such as multiple events in the same period, missing activity, and partial current periods

Scope guidance

Focus on the SQL design and reasoning rather than a single exact query. The interviewer expects a practical explanation of the steps, common table structure, and the main pitfalls that affect correctness.

Key Concepts

Cohort Assignment

The first step is assigning each user to a cohort, usually based on their signup date or first qualifying event. This creates a stable reference point for comparing later activity across users who started in the same period.

SELECT user_id, DATE_TRUNC('month', MIN(event_date)) AS cohort_month
FROM user_events
GROUP BY user_id;

Retention Event Definition

Retention depends on a clear definition of what counts as returning activity, such as any session, purchase, or app open in a later week or month. The SQL must distinguish the cohort event from subsequent qualifying activity.

CASE WHEN event_type IN ('session', 'purchase') THEN 1 ELSE 0 END

Period Offset Calculation

Retention is measured relative to the cohort start, so each activity row needs a period number like month 0, month 1, or week 2. This lets you compare users across cohorts on the same lifecycle timeline instead of calendar dates.

((EXTRACT(YEAR FROM activity_month) - EXTRACT(YEAR FROM cohort_month)) * 12 + (EXTRACT(MONTH FROM activity_month) - EXTRACT(MONTH FROM cohort_month))) AS month_number

Distinct User Aggregation

Users often generate multiple events in the same period, so retention should usually count distinct users, not raw events. Without deduplication, retention rates can be inflated and inconsistent across cohorts.

COUNT(DISTINCT user_id) AS retained_users

Retention Rate Calculation

After counting retained users by cohort and period, divide by the cohort size to get the retention rate. This normalization is what makes cohorts comparable even when they have different numbers of users.

retained_users::numeric / NULLIF(cohort_size, 0) AS retention_rate

Problem

Context

Cohort retention is a common product analytics pattern used to measure how well users return after signup or first activity. Interviewers ask this to evaluate whether you can translate a business metric into a clean SQL workflow.

Core question

Explain how you would structure a SQL analysis to measure retention by cohort. Your answer should cover:

  1. How to define the cohort date for each user
  2. How to identify retained users in later periods
  3. How to calculate the period offset from the cohort start
  4. How to aggregate results into a cohort retention table or matrix
  5. How to handle edge cases such as multiple events in the same period, missing activity, and partial current periods

Scope guidance

Focus on the SQL design and reasoning rather than a single exact query. The interviewer expects a practical explanation of the steps, common table structure, and the main pitfalls that affect correctness.

Key Concepts

Cohort Assignment

The first step is assigning each user to a cohort, usually based on their signup date or first qualifying event. This creates a stable reference point for comparing later activity across users who started in the same period.

SELECT user_id, DATE_TRUNC('month', MIN(event_date)) AS cohort_month
FROM user_events
GROUP BY user_id;

Retention Event Definition

Retention depends on a clear definition of what counts as returning activity, such as any session, purchase, or app open in a later week or month. The SQL must distinguish the cohort event from subsequent qualifying activity.

CASE WHEN event_type IN ('session', 'purchase') THEN 1 ELSE 0 END

Period Offset Calculation

Retention is measured relative to the cohort start, so each activity row needs a period number like month 0, month 1, or week 2. This lets you compare users across cohorts on the same lifecycle timeline instead of calendar dates.

((EXTRACT(YEAR FROM activity_month) - EXTRACT(YEAR FROM cohort_month)) * 12 + (EXTRACT(MONTH FROM activity_month) - EXTRACT(MONTH FROM cohort_month))) AS month_number

Distinct User Aggregation

Users often generate multiple events in the same period, so retention should usually count distinct users, not raw events. Without deduplication, retention rates can be inflated and inconsistent across cohorts.

COUNT(DISTINCT user_id) AS retained_users

Retention Rate Calculation

After counting retained users by cohort and period, divide by the cohort size to get the retention rate. This normalization is what makes cohorts comparable even when they have different numbers of users.

retained_users::numeric / NULLIF(cohort_size, 0) AS retention_rate
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
IntuitCompare User Retention by CohortMediumABuild Monthly Retention Cohort TableHardSpliceMeasure Retention After Product LaunchMedium
Next question