Interview Guides

Unified Customer 360 Data Model | Dataford Interview Questions - Dataford - Ace your Interview

Unified Customer 360 Data Model

Medium

SQL & Data Manipulation

Data ModelingJoinsAggregationsSubqueriesData Wrangling

Problem

Context

You’re joining a fintech SaaS platform that sells an API-based fraud detection product to mid-market merchants. The company has ~200K business customers, processes tens of millions of API events per day, and bills customers monthly based on a mix of subscription plans + usage overages. Leadership wants a reliable Customer 360 dataset so that Sales can see product adoption, Finance can reconcile invoices, and Risk can monitor unusual usage patterns that could indicate abuse.

Today, the data is fragmented:

Product usage events live in an event stream (high volume, late-arriving events, occasional duplicates).
Billing data lives in a payments system (invoices, line items, credits, refunds).
Customer attributes live in a CRM (account owner, segment, region) and can change over time.

Core Question

How would you approach building a unified data model that combines product usage, billing, and customer data?

In your answer, cover:

Entity design: What are the canonical entities (customer/account, subscription, invoice, usage) and what keys would you use to join them?
Grain & aggregation: What grain would you choose for the unified model (daily customer, monthly customer, customer-product-day, etc.) and why?
SQL implementation approach: How would you structure the SQL (CTEs, staging tables, incremental models) to dedupe events, handle late-arriving usage, and reconcile invoice totals?
Join strategy & data quality: How do you avoid fanout joins, double counting, and mismatched IDs across systems? What checks would you add?
Handling slowly changing customer attributes: How would you model changing CRM fields (segment, owner, region) so historical reporting remains correct?

Scope Guidance (What the interviewer expects)

Discuss trade-offs and failure modes (e.g., usage billed in a different month, credits applied after invoice creation, backfills). Use concrete SQL patterns (pre-aggregation before joins, surrogate keys, effective-dated joins) and explain how you’d validate correctness at scale.

Key Concepts

Choosing the correct grain (fact table design)

A unified model must have an explicit grain (e.g., customer-month) to prevent ambiguity and double counting. Usage is typically event-level, while billing is invoice-level; you almost always need to aggregate one or both to a shared grain before joining.

WITH usage_customer_month AS (
  SELECT customer_id,
         DATE_TRUNC('month', event_ts) AS month,
         COUNT(*) AS api_calls
  FROM usage_events
  GROUP BY 1,2
)
SELECT * FROM usage_customer_month;

Pre-aggregation to avoid fanout joins

Joining raw usage events to invoice line items can multiply rows (fanout), inflating metrics. The safe pattern is to aggregate each source to the target grain first, then join aggregated datasets.

WITH inv AS (
  SELECT customer_id, invoice_id, invoice_month,
         SUM(amount_cents) AS invoice_total_cents
  FROM invoice_line_items
  GROUP BY 1,2,3
),
usage AS (
  SELECT customer_id, invoice_month,
         SUM(billable_units) AS billable_units
  FROM usage_daily
  GROUP BY 1,2
)
SELECT inv.customer_id, inv.invoice_month, inv.invoice_total_cents, usage.billable_units
FROM inv
LEFT JOIN usage
  ON inv.customer_id = usage.customer_id
 AND inv.invoice_month = usage.invoice_month;

Conformed dimensions and identity resolution

Different systems often use different identifiers (crm_account_id vs billing_customer_id). A conformed dimension (mapping table) provides a single canonical customer key and supports many-to-one or historical mappings if needed.

SELECT u.event_id, m.canonical_customer_id
FROM usage_events u
JOIN customer_id_map m
  ON u.billing_customer_id = m.billing_customer_id;

Slowly changing dimensions (SCD Type 2) for customer attributes

If segment/owner changes over time, you need effective-dated records so historical reports reflect the attribute values at the time of the event/invoice. This is commonly implemented as SCD2 with valid_from/valid_to and an effective-dated join.

SELECT f.customer_id, f.month, d.segment
FROM customer_month_fact f
JOIN dim_customer_scd2 d
  ON f.customer_id = d.customer_id
 AND f.month >= d.valid_from
 AND f.month < COALESCE(d.valid_to, DATE '9999-12-31');

Reconciliation and data quality checks in SQL

A unified model must be provably correct: invoice totals should match the sum of line items; usage billed should align with metering rules; duplicates and late-arriving events should be detected. These checks are often implemented as assertions or anomaly queries in the pipeline.

SELECT invoice_id
FROM (
  SELECT invoice_id,
         SUM(line_amount_cents) AS sum_lines,
         MAX(invoice_total_cents) AS header_total
  FROM billing_lines
  GROUP BY 1
) t
WHERE sum_lines <> header_total;

Problem

Context

Today, the data is fragmented:

Product usage events live in an event stream (high volume, late-arriving events, occasional duplicates).
Billing data lives in a payments system (invoices, line items, credits, refunds).
Customer attributes live in a CRM (account owner, segment, region) and can change over time.

Core Question

How would you approach building a unified data model that combines product usage, billing, and customer data?

In your answer, cover:

Entity design: What are the canonical entities (customer/account, subscription, invoice, usage) and what keys would you use to join them?
Grain & aggregation: What grain would you choose for the unified model (daily customer, monthly customer, customer-product-day, etc.) and why?
SQL implementation approach: How would you structure the SQL (CTEs, staging tables, incremental models) to dedupe events, handle late-arriving usage, and reconcile invoice totals?
Join strategy & data quality: How do you avoid fanout joins, double counting, and mismatched IDs across systems? What checks would you add?
Handling slowly changing customer attributes: How would you model changing CRM fields (segment, owner, region) so historical reporting remains correct?

Scope Guidance (What the interviewer expects)

Key Concepts

Choosing the correct grain (fact table design)

WITH usage_customer_month AS (
  SELECT customer_id,
         DATE_TRUNC('month', event_ts) AS month,
         COUNT(*) AS api_calls
  FROM usage_events
  GROUP BY 1,2
)
SELECT * FROM usage_customer_month;

Pre-aggregation to avoid fanout joins

Joining raw usage events to invoice line items can multiply rows (fanout), inflating metrics. The safe pattern is to aggregate each source to the target grain first, then join aggregated datasets.

WITH inv AS (
  SELECT customer_id, invoice_id, invoice_month,
         SUM(amount_cents) AS invoice_total_cents
  FROM invoice_line_items
  GROUP BY 1,2,3
),
usage AS (
  SELECT customer_id, invoice_month,
         SUM(billable_units) AS billable_units
  FROM usage_daily
  GROUP BY 1,2
)
SELECT inv.customer_id, inv.invoice_month, inv.invoice_total_cents, usage.billable_units
FROM inv
LEFT JOIN usage
  ON inv.customer_id = usage.customer_id
 AND inv.invoice_month = usage.invoice_month;

Conformed dimensions and identity resolution

SELECT u.event_id, m.canonical_customer_id
FROM usage_events u
JOIN customer_id_map m
  ON u.billing_customer_id = m.billing_customer_id;

Slowly changing dimensions (SCD Type 2) for customer attributes

SELECT f.customer_id, f.month, d.segment
FROM customer_month_fact f
JOIN dim_customer_scd2 d
  ON f.customer_id = d.customer_id
 AND f.month >= d.valid_from
 AND f.month < COALESCE(d.valid_to, DATE '9999-12-31');

Reconciliation and data quality checks in SQL

SELECT invoice_id
FROM (
  SELECT invoice_id,
         SUM(line_amount_cents) AS sum_lines,
         MAX(invoice_total_cents) AS header_total
  FROM billing_lines
  GROUP BY 1
) t
WHERE sum_lines <> header_total;

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Build SCD2 Customer Dimension in SnowflakeHard

Next question

Unified Customer 360 Data Model

Medium

SQL & Data Manipulation

Data ModelingJoinsAggregationsSubqueriesData Wrangling

Problem

Context

Today, the data is fragmented:

Product usage events live in an event stream (high volume, late-arriving events, occasional duplicates).
Billing data lives in a payments system (invoices, line items, credits, refunds).
Customer attributes live in a CRM (account owner, segment, region) and can change over time.

Core Question

How would you approach building a unified data model that combines product usage, billing, and customer data?

In your answer, cover:

Entity design: What are the canonical entities (customer/account, subscription, invoice, usage) and what keys would you use to join them?
Grain & aggregation: What grain would you choose for the unified model (daily customer, monthly customer, customer-product-day, etc.) and why?
SQL implementation approach: How would you structure the SQL (CTEs, staging tables, incremental models) to dedupe events, handle late-arriving usage, and reconcile invoice totals?
Join strategy & data quality: How do you avoid fanout joins, double counting, and mismatched IDs across systems? What checks would you add?
Handling slowly changing customer attributes: How would you model changing CRM fields (segment, owner, region) so historical reporting remains correct?

Scope Guidance (What the interviewer expects)

Key Concepts

Choosing the correct grain (fact table design)

WITH usage_customer_month AS (
  SELECT customer_id,
         DATE_TRUNC('month', event_ts) AS month,
         COUNT(*) AS api_calls
  FROM usage_events
  GROUP BY 1,2
)
SELECT * FROM usage_customer_month;

Pre-aggregation to avoid fanout joins

Joining raw usage events to invoice line items can multiply rows (fanout), inflating metrics. The safe pattern is to aggregate each source to the target grain first, then join aggregated datasets.

WITH inv AS (
  SELECT customer_id, invoice_id, invoice_month,
         SUM(amount_cents) AS invoice_total_cents
  FROM invoice_line_items
  GROUP BY 1,2,3
),
usage AS (
  SELECT customer_id, invoice_month,
         SUM(billable_units) AS billable_units
  FROM usage_daily
  GROUP BY 1,2
)
SELECT inv.customer_id, inv.invoice_month, inv.invoice_total_cents, usage.billable_units
FROM inv
LEFT JOIN usage
  ON inv.customer_id = usage.customer_id
 AND inv.invoice_month = usage.invoice_month;

Conformed dimensions and identity resolution

SELECT u.event_id, m.canonical_customer_id
FROM usage_events u
JOIN customer_id_map m
  ON u.billing_customer_id = m.billing_customer_id;

Slowly changing dimensions (SCD Type 2) for customer attributes

SELECT f.customer_id, f.month, d.segment
FROM customer_month_fact f
JOIN dim_customer_scd2 d
  ON f.customer_id = d.customer_id
 AND f.month >= d.valid_from
 AND f.month < COALESCE(d.valid_to, DATE '9999-12-31');

Reconciliation and data quality checks in SQL

SELECT invoice_id
FROM (
  SELECT invoice_id,
         SUM(line_amount_cents) AS sum_lines,
         MAX(invoice_total_cents) AS header_total
  FROM billing_lines
  GROUP BY 1
) t
WHERE sum_lines <> header_total;

Problem

Context

Today, the data is fragmented:

Product usage events live in an event stream (high volume, late-arriving events, occasional duplicates).
Billing data lives in a payments system (invoices, line items, credits, refunds).
Customer attributes live in a CRM (account owner, segment, region) and can change over time.

Core Question

How would you approach building a unified data model that combines product usage, billing, and customer data?

In your answer, cover:

Entity design: What are the canonical entities (customer/account, subscription, invoice, usage) and what keys would you use to join them?
Grain & aggregation: What grain would you choose for the unified model (daily customer, monthly customer, customer-product-day, etc.) and why?
SQL implementation approach: How would you structure the SQL (CTEs, staging tables, incremental models) to dedupe events, handle late-arriving usage, and reconcile invoice totals?
Join strategy & data quality: How do you avoid fanout joins, double counting, and mismatched IDs across systems? What checks would you add?
Handling slowly changing customer attributes: How would you model changing CRM fields (segment, owner, region) so historical reporting remains correct?

Scope Guidance (What the interviewer expects)

Key Concepts

Choosing the correct grain (fact table design)

WITH usage_customer_month AS (
  SELECT customer_id,
         DATE_TRUNC('month', event_ts) AS month,
         COUNT(*) AS api_calls
  FROM usage_events
  GROUP BY 1,2
)
SELECT * FROM usage_customer_month;

Pre-aggregation to avoid fanout joins

Joining raw usage events to invoice line items can multiply rows (fanout), inflating metrics. The safe pattern is to aggregate each source to the target grain first, then join aggregated datasets.

WITH inv AS (
  SELECT customer_id, invoice_id, invoice_month,
         SUM(amount_cents) AS invoice_total_cents
  FROM invoice_line_items
  GROUP BY 1,2,3
),
usage AS (
  SELECT customer_id, invoice_month,
         SUM(billable_units) AS billable_units
  FROM usage_daily
  GROUP BY 1,2
)
SELECT inv.customer_id, inv.invoice_month, inv.invoice_total_cents, usage.billable_units
FROM inv
LEFT JOIN usage
  ON inv.customer_id = usage.customer_id
 AND inv.invoice_month = usage.invoice_month;

Conformed dimensions and identity resolution

SELECT u.event_id, m.canonical_customer_id
FROM usage_events u
JOIN customer_id_map m
  ON u.billing_customer_id = m.billing_customer_id;

Slowly changing dimensions (SCD Type 2) for customer attributes

SELECT f.customer_id, f.month, d.segment
FROM customer_month_fact f
JOIN dim_customer_scd2 d
  ON f.customer_id = d.customer_id
 AND f.month >= d.valid_from
 AND f.month < COALESCE(d.valid_to, DATE '9999-12-31');

Reconciliation and data quality checks in SQL

SELECT invoice_id
FROM (
  SELECT invoice_id,
         SUM(line_amount_cents) AS sum_lines,
         MAX(invoice_total_cents) AS header_total
  FROM billing_lines
  GROUP BY 1
) t
WHERE sum_lines <> header_total;

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Build SCD2 Customer Dimension in SnowflakeHard

Next question