At NovaCart, monthly retention reporting depends on reliable transaction dates. If transaction records have missing dates, retention metrics can be understated, overstated, or assigned to the wrong month.
Explain how you would clean and prepare a transaction table with missing customer transaction dates for a monthly retention analysis. Your answer should cover:
The interviewer expects a practical SQL-focused explanation, not just a definition. Discuss data quality checks, trade-offs in handling missing dates, and how the cleaned output should support downstream monthly cohort or repeat-activity analysis.
Before calculating retention, you need to measure how many rows have NULL, malformed, or suspicious transaction dates. This determines whether simple exclusion is acceptable or whether a recovery strategy is needed.
SELECT COUNT(*) AS total_rows,
COUNT(*) FILTER (WHERE transaction_date IS NULL) AS missing_dates
FROM transactions;
Missing dates should not be filled blindly. A strong answer explains that some rows may be excluded, some may be imputed from trusted fallback fields, and all such decisions should be flagged for auditability.
SELECT transaction_id,
COALESCE(transaction_date, created_at::date) AS cleaned_date,
CASE
WHEN transaction_date IS NULL AND created_at IS NOT NULL THEN 'imputed_from_created_at'
WHEN transaction_date IS NULL THEN 'excluded'
ELSE 'original'
END AS date_status
FROM transactions;
Retention is usually based on whether a customer was active in a month, not on raw transaction rows. After cleaning dates, convert each transaction to a month bucket and deduplicate to one customer-month record.
SELECT DISTINCT customer_id,
DATE_TRUNC('month', cleaned_date)::date AS activity_month
FROM cleaned_transactions;
A retention-ready table should clearly represent customer activity by month, with enough metadata to trace cleaning decisions. This prevents hidden assumptions from affecting cohort calculations later.
SELECT customer_id,
DATE_TRUNC('month', cleaned_date)::date AS activity_month,
MAX(date_status) AS date_status
FROM prepared_transactions
GROUP BY customer_id, DATE_TRUNC('month', cleaned_date)::date;