Context
You’re interviewing with the Risk & Growth Analytics team at a large fintech (think “Stripe-scale”) that serves 25M monthly active customers across North America and Europe. The company uses Snowflake as its analytics warehouse and dbt + Airflow for ELT orchestration. Customer attributes (legal name, address, KYC status, risk tier, marketing opt-in, business type) are used by downstream teams for regulatory reporting (KYC/AML), credit decisioning, and lifecycle marketing.
Today, customer attributes are stored in a “latest snapshot” table (customer_current). This causes two major problems:
- Analysts can’t answer time-travel questions like “What risk tier did the customer have when they made a chargeback?”
- Compliance audits require proving what the company knew at the time (e.g., KYC status as-of a transaction date). A recent audit found gaps, and the business has committed to fixing this within a quarter.
You need to design and implement a Slowly Changing Dimension (SCD) Type 2 pipeline for a dim_customer table in Snowflake, sourced from multiple upstream systems with late-arriving and out-of-order updates.
Data Sources & Current Architecture
Upstream systems (all land into Snowflake RAW schema via existing ingestion):
- Customer Service CRM: updates to name/email/phone; can backfill corrections days later.
- KYC provider: status changes (PENDING → VERIFIED → REJECTED), sometimes replays events.
- Risk engine: risk tier updates (LOW/MED/HIGH) with occasional retroactive corrections.
RAW tables are append-only and partitioned by ingestion time:
raw.crm_customer_updates
raw.kyc_status_events
raw.risk_tier_events
A daily Airflow DAG currently builds analytics.customer_current by taking the latest record per customer from each source and joining them.
Scale Requirements
- Customers: ~120M total historical customers, ~25M active monthly
- Change volume: ~8–15M attribute-change events/day across all sources
- Late arrivals: up to 7 days late (rarely 30 days for compliance corrections)
- Freshness: SCD2 dimension must be queryable by analysts by 06:00 UTC daily
- Retention: RAW retained 2 years; SCD2 retained indefinitely
- SLA: 99.5% daily pipeline success; recovery (re-run) within 2 hours
Data Characteristics
You will build analytics.dim_customer with these columns (minimum):
| Column | Type | Notes |
|---|
| customer_sk | NUMBER | Surrogate key (monotonic sequence or hash) |
| customer_id | STRING | Natural key |
| legal_name | STRING | From CRM |
| email | STRING | From CRM |
| country | STRING | From CRM |
| kyc_status | STRING | From KYC provider |
| risk_tier | STRING | From risk engine |
| effective_from_ts | TIMESTAMP | Start of validity |
| effective_to_ts | TIMESTAMP | End of validity (open-ended for current) |
| is_current | BOOLEAN | Convenience flag |
| record_hash | STRING | Hash of tracked attributes |
| source_max_event_ts | TIMESTAMP | Latest contributing event time |
| loaded_at | TIMESTAMP | Warehouse load time |
Event tables have:
customer_id
event_ts (when the upstream system says the change happened)
ingested_at (when it landed in Snowflake)
- payload fields
event_id (not always unique across sources; some sources replay)
Known issues:
- Duplicate events (replays)
- Out-of-order events (event_ts not increasing)
- Null/blank overwrites (CRM sometimes sends empty email)
- Occasional customer_id reuse is not allowed (assume stable natural key)
Your Task (What You Must Design)
Design a complete SCD Type 2 implementation for analytics.dim_customer that supports:
Functional requirements
- Type 2 history for tracked attributes:
legal_name, email, country, kyc_status, risk_tier.
- Correct handling of late-arriving and out-of-order updates using
event_ts (not ingested_at) as the business-effective time.
- Idempotent daily runs: re-running the same Airflow task for a date must not create duplicate dimension rows.
- Deduplication across sources (replayed events) using a deterministic rule.
- Produce a current view (
dim_customer_current) and support point-in-time joins (as-of joins) from fact tables.
Non-functional requirements
- Pipeline completes within 90 minutes on typical days.
- Supports backfills of up to 30 days without full rebuild.
- Includes data quality checks and clear failure modes.
- Minimizes Snowflake cost (avoid full-table merges/scans when possible).
Constraints
- Warehouse: Snowflake Enterprise; you can create streams/tasks but the org standard is Airflow + dbt.
- Team skillset: strong SQL/dbt; limited Spark.
- Compliance: must be able to explain lineage and logic to auditors; keep an audit trail of changes.
- Facts: downstream fact tables (payments, disputes) join by
customer_id and event_ts.
Interview Prompts
In your answer, walk through:
- The table design and keys (surrogate key strategy, unique constraints you enforce logically).
- The incremental processing strategy (what window you reprocess daily to capture late arrivals).
- The merge logic for SCD2 (closing prior records, inserting new versions).
- How you handle late events that fall inside already-closed intervals (e.g., an event arrives today with event_ts from 5 days ago).
- How you ensure idempotency and prevent duplicates.
- What data quality checks you implement and what happens on failure.
- How you would monitor and tune performance in Snowflake.
You may assume you can create intermediate staging tables in analytics_staging and that dbt runs daily via Airflow.