Context
You’re joining a fintech SaaS platform that sells an API-based fraud detection product to mid-market merchants. The company has ~200K business customers, processes tens of millions of API events per day, and bills customers monthly based on a mix of subscription plans + usage overages. Leadership wants a reliable Customer 360 dataset so that Sales can see product adoption, Finance can reconcile invoices, and Risk can monitor unusual usage patterns that could indicate abuse.
Today, the data is fragmented:
- Product usage events live in an event stream (high volume, late-arriving events, occasional duplicates).
- Billing data lives in a payments system (invoices, line items, credits, refunds).
- Customer attributes live in a CRM (account owner, segment, region) and can change over time.
Core Question
How would you approach building a unified data model that combines product usage, billing, and customer data?
In your answer, cover:
- Entity design: What are the canonical entities (customer/account, subscription, invoice, usage) and what keys would you use to join them?
- Grain & aggregation: What grain would you choose for the unified model (daily customer, monthly customer, customer-product-day, etc.) and why?
- SQL implementation approach: How would you structure the SQL (CTEs, staging tables, incremental models) to dedupe events, handle late-arriving usage, and reconcile invoice totals?
- Join strategy & data quality: How do you avoid fanout joins, double counting, and mismatched IDs across systems? What checks would you add?
- Handling slowly changing customer attributes: How would you model changing CRM fields (segment, owner, region) so historical reporting remains correct?
Scope Guidance (What the interviewer expects)
Discuss trade-offs and failure modes (e.g., usage billed in a different month, credits applied after invoice creation, backfills). Use concrete SQL patterns (pre-aggregation before joins, surrogate keys, effective-dated joins) and explain how you’d validate correctness at scale.