You’re joining a fintech SaaS platform that sells an API-based fraud detection product to mid-market merchants. The company has ~200K business customers, processes tens of millions of API events per day, and bills customers monthly based on a mix of subscription plans + usage overages. Leadership wants a reliable Customer 360 dataset so that Sales can see product adoption, Finance can reconcile invoices, and Risk can monitor unusual usage patterns that could indicate abuse.
Today, the data is fragmented:
How would you approach building a unified data model that combines product usage, billing, and customer data?
In your answer, cover:
Discuss trade-offs and failure modes (e.g., usage billed in a different month, credits applied after invoice creation, backfills). Use concrete SQL patterns (pre-aggregation before joins, surrogate keys, effective-dated joins) and explain how you’d validate correctness at scale.
A unified model must have an explicit grain (e.g., customer-month) to prevent ambiguity and double counting. Usage is typically event-level, while billing is invoice-level; you almost always need to aggregate one or both to a shared grain before joining.
WITH usage_customer_month AS (
SELECT customer_id,
DATE_TRUNC('month', event_ts) AS month,
COUNT(*) AS api_calls
FROM usage_events
GROUP BY 1,2
)
SELECT * FROM usage_customer_month;
Joining raw usage events to invoice line items can multiply rows (fanout), inflating metrics. The safe pattern is to aggregate each source to the target grain first, then join aggregated datasets.
WITH inv AS (
SELECT customer_id, invoice_id, invoice_month,
SUM(amount_cents) AS invoice_total_cents
FROM invoice_line_items
GROUP BY 1,2,3
),
usage AS (
SELECT customer_id, invoice_month,
SUM(billable_units) AS billable_units
FROM usage_daily
GROUP BY 1,2
)
SELECT inv.customer_id, inv.invoice_month, inv.invoice_total_cents, usage.billable_units
FROM inv
LEFT JOIN usage
ON inv.customer_id = usage.customer_id
AND inv.invoice_month = usage.invoice_month;
Different systems often use different identifiers (crm_account_id vs billing_customer_id). A conformed dimension (mapping table) provides a single canonical customer key and supports many-to-one or historical mappings if needed.
SELECT u.event_id, m.canonical_customer_id
FROM usage_events u
JOIN customer_id_map m
ON u.billing_customer_id = m.billing_customer_id;
If segment/owner changes over time, you need effective-dated records so historical reports reflect the attribute values at the time of the event/invoice. This is commonly implemented as SCD2 with valid_from/valid_to and an effective-dated join.
SELECT f.customer_id, f.month, d.segment
FROM customer_month_fact f
JOIN dim_customer_scd2 d
ON f.customer_id = d.customer_id
AND f.month >= d.valid_from
AND f.month < COALESCE(d.valid_to, DATE '9999-12-31');
A unified model must be provably correct: invoice totals should match the sum of line items; usage billed should align with metering rules; duplicates and late-arriving events should be detected. These checks are often implemented as assertions or anomaly queries in the pipeline.
SELECT invoice_id
FROM (
SELECT invoice_id,
SUM(line_amount_cents) AS sum_lines,
MAX(invoice_total_cents) AS header_total
FROM billing_lines
GROUP BY 1
) t
WHERE sum_lines <> header_total;