Build SCD2 Customer Dimension in Snowflake

Context

You’re interviewing with the Risk & Growth Analytics team at a large fintech (think “Stripe-scale”) that serves 25M monthly active customers across North America and Europe. The company uses Snowflake as its analytics warehouse and dbt + Airflow for ELT orchestration. Customer attributes (legal name, address, KYC status, risk tier, marketing opt-in, business type) are used by downstream teams for regulatory reporting (KYC/AML), credit decisioning, and lifecycle marketing.

Today, customer attributes are stored in a “latest snapshot” table (customer_current). This causes two major problems:

Analysts can’t answer time-travel questions like “What risk tier did the customer have when they made a chargeback?”
Compliance audits require proving what the company knew at the time (e.g., KYC status as-of a transaction date). A recent audit found gaps, and the business has committed to fixing this within a quarter.

You need to design and implement a Slowly Changing Dimension (SCD) Type 2 pipeline for a dim_customer table in Snowflake, sourced from multiple upstream systems with late-arriving and out-of-order updates.

Data Sources & Current Architecture

Upstream systems (all land into Snowflake RAW schema via existing ingestion):

Customer Service CRM: updates to name/email/phone; can backfill corrections days later.
KYC provider: status changes (PENDING → VERIFIED → REJECTED), sometimes replays events.
Risk engine: risk tier updates (LOW/MED/HIGH) with occasional retroactive corrections.

RAW tables are append-only and partitioned by ingestion time:

raw.crm_customer_updates
raw.kyc_status_events
raw.risk_tier_events

A daily Airflow DAG currently builds analytics.customer_current by taking the latest record per customer from each source and joining them.

Scale Requirements

Customers: ~120M total historical customers, ~25M active monthly
Change volume: ~8–15M attribute-change events/day across all sources
Late arrivals: up to 7 days late (rarely 30 days for compliance corrections)
Freshness: SCD2 dimension must be queryable by analysts by 06:00 UTC daily
Retention: RAW retained 2 years; SCD2 retained indefinitely
SLA: 99.5% daily pipeline success; recovery (re-run) within 2 hours

Data Characteristics

You will build analytics.dim_customer with these columns (minimum):

Column	Type	Notes
customer_sk	NUMBER	Surrogate key (monotonic sequence or hash)
customer_id	STRING	Natural key
legal_name	STRING	From CRM
email	STRING	From CRM
country	STRING	From CRM
kyc_status	STRING	From KYC provider
risk_tier	STRING	From risk engine
effective_from_ts	TIMESTAMP	Start of validity
effective_to_ts	TIMESTAMP	End of validity (open-ended for current)
is_current	BOOLEAN	Convenience flag
record_hash	STRING	Hash of tracked attributes
source_max_event_ts	TIMESTAMP	Latest contributing event time
loaded_at	TIMESTAMP	Warehouse load time

Event tables have:

customer_id
event_ts (when the upstream system says the change happened)
ingested_at (when it landed in Snowflake)
payload fields
event_id (not always unique across sources; some sources replay)

Known issues:

Duplicate events (replays)
Out-of-order events (event_ts not increasing)
Null/blank overwrites (CRM sometimes sends empty email)
Occasional customer_id reuse is not allowed (assume stable natural key)

Your Task (What You Must Design)

Design a complete SCD Type 2 implementation for analytics.dim_customer that supports:

Functional requirements

Type 2 history for tracked attributes: legal_name, email, country, kyc_status, risk_tier.
Correct handling of late-arriving and out-of-order updates using event_ts (not ingested_at) as the business-effective time.
Idempotent daily runs: re-running the same Airflow task for a date must not create duplicate dimension rows.
Deduplication across sources (replayed events) using a deterministic rule.
Produce a current view (dim_customer_current) and support point-in-time joins (as-of joins) from fact tables.

Non-functional requirements

Pipeline completes within 90 minutes on typical days.
Supports backfills of up to 30 days without full rebuild.
Includes data quality checks and clear failure modes.
Minimizes Snowflake cost (avoid full-table merges/scans when possible).

Constraints

Warehouse: Snowflake Enterprise; you can create streams/tasks but the org standard is Airflow + dbt.
Team skillset: strong SQL/dbt; limited Spark.
Compliance: must be able to explain lineage and logic to auditors; keep an audit trail of changes.
Facts: downstream fact tables (payments, disputes) join by customer_id and event_ts.

Interview Prompts

In your answer, walk through:

The table design and keys (surrogate key strategy, unique constraints you enforce logically).
The incremental processing strategy (what window you reprocess daily to capture late arrivals).
The merge logic for SCD2 (closing prior records, inserting new versions).
How you handle late events that fall inside already-closed intervals (e.g., an event arrives today with event_ts from 5 days ago).
How you ensure idempotency and prevent duplicates.
What data quality checks you implement and what happens on failure.
How you would monitor and tune performance in Snowflake.

You may assume you can create intermediate staging tables in analytics_staging and that dbt runs daily via Airflow.

Context

Today, customer attributes are stored in a “latest snapshot” table (customer_current). This causes two major problems:

Analysts can’t answer time-travel questions like “What risk tier did the customer have when they made a chargeback?”
Compliance audits require proving what the company knew at the time (e.g., KYC status as-of a transaction date). A recent audit found gaps, and the business has committed to fixing this within a quarter.

Data Sources & Current Architecture

Upstream systems (all land into Snowflake RAW schema via existing ingestion):

Customer Service CRM: updates to name/email/phone; can backfill corrections days later.
KYC provider: status changes (PENDING → VERIFIED → REJECTED), sometimes replays events.
Risk engine: risk tier updates (LOW/MED/HIGH) with occasional retroactive corrections.

RAW tables are append-only and partitioned by ingestion time:

raw.crm_customer_updates
raw.kyc_status_events
raw.risk_tier_events

A daily Airflow DAG currently builds analytics.customer_current by taking the latest record per customer from each source and joining them.

Scale Requirements

Customers: ~120M total historical customers, ~25M active monthly
Change volume: ~8–15M attribute-change events/day across all sources
Late arrivals: up to 7 days late (rarely 30 days for compliance corrections)
Freshness: SCD2 dimension must be queryable by analysts by 06:00 UTC daily
Retention: RAW retained 2 years; SCD2 retained indefinitely
SLA: 99.5% daily pipeline success; recovery (re-run) within 2 hours

Data Characteristics

You will build analytics.dim_customer with these columns (minimum):

Column	Type	Notes
customer_sk	NUMBER	Surrogate key (monotonic sequence or hash)
customer_id	STRING	Natural key
legal_name	STRING	From CRM
email	STRING	From CRM
country	STRING	From CRM
kyc_status	STRING	From KYC provider
risk_tier	STRING	From risk engine
effective_from_ts	TIMESTAMP	Start of validity
effective_to_ts	TIMESTAMP	End of validity (open-ended for current)
is_current	BOOLEAN	Convenience flag
record_hash	STRING	Hash of tracked attributes
source_max_event_ts	TIMESTAMP	Latest contributing event time
loaded_at	TIMESTAMP	Warehouse load time

Event tables have:

customer_id
event_ts (when the upstream system says the change happened)
ingested_at (when it landed in Snowflake)
payload fields
event_id (not always unique across sources; some sources replay)

Known issues:

Duplicate events (replays)
Out-of-order events (event_ts not increasing)
Null/blank overwrites (CRM sometimes sends empty email)
Occasional customer_id reuse is not allowed (assume stable natural key)

Your Task (What You Must Design)

Design a complete SCD Type 2 implementation for analytics.dim_customer that supports:

Functional requirements

Type 2 history for tracked attributes: legal_name, email, country, kyc_status, risk_tier.
Correct handling of late-arriving and out-of-order updates using event_ts (not ingested_at) as the business-effective time.
Idempotent daily runs: re-running the same Airflow task for a date must not create duplicate dimension rows.
Deduplication across sources (replayed events) using a deterministic rule.
Produce a current view (dim_customer_current) and support point-in-time joins (as-of joins) from fact tables.

Non-functional requirements

Pipeline completes within 90 minutes on typical days.
Supports backfills of up to 30 days without full rebuild.
Includes data quality checks and clear failure modes.
Minimizes Snowflake cost (avoid full-table merges/scans when possible).

Constraints

Warehouse: Snowflake Enterprise; you can create streams/tasks but the org standard is Airflow + dbt.
Team skillset: strong SQL/dbt; limited Spark.
Compliance: must be able to explain lineage and logic to auditors; keep an audit trail of changes.
Facts: downstream fact tables (payments, disputes) join by customer_id and event_ts.

Interview Prompts

In your answer, walk through:

The table design and keys (surrogate key strategy, unique constraints you enforce logically).
The incremental processing strategy (what window you reprocess daily to capture late arrivals).
The merge logic for SCD2 (closing prior records, inserting new versions).
How you handle late events that fall inside already-closed intervals (e.g., an event arrives today with event_ts from 5 days ago).
How you ensure idempotency and prevent duplicates.
What data quality checks you implement and what happens on failure.
How you would monitor and tune performance in Snowflake.

You may assume you can create intermediate staging tables in analytics_staging and that dbt runs daily via Airflow.

Context

Today, customer attributes are stored in a “latest snapshot” table (customer_current). This causes two major problems:

Analysts can’t answer time-travel questions like “What risk tier did the customer have when they made a chargeback?”
Compliance audits require proving what the company knew at the time (e.g., KYC status as-of a transaction date). A recent audit found gaps, and the business has committed to fixing this within a quarter.

Data Sources & Current Architecture

Upstream systems (all land into Snowflake RAW schema via existing ingestion):

Customer Service CRM: updates to name/email/phone; can backfill corrections days later.
KYC provider: status changes (PENDING → VERIFIED → REJECTED), sometimes replays events.
Risk engine: risk tier updates (LOW/MED/HIGH) with occasional retroactive corrections.

RAW tables are append-only and partitioned by ingestion time:

raw.crm_customer_updates
raw.kyc_status_events
raw.risk_tier_events

A daily Airflow DAG currently builds analytics.customer_current by taking the latest record per customer from each source and joining them.

Scale Requirements

Customers: ~120M total historical customers, ~25M active monthly
Change volume: ~8–15M attribute-change events/day across all sources
Late arrivals: up to 7 days late (rarely 30 days for compliance corrections)
Freshness: SCD2 dimension must be queryable by analysts by 06:00 UTC daily
Retention: RAW retained 2 years; SCD2 retained indefinitely
SLA: 99.5% daily pipeline success; recovery (re-run) within 2 hours

Data Characteristics

You will build analytics.dim_customer with these columns (minimum):

Column	Type	Notes
customer_sk	NUMBER	Surrogate key (monotonic sequence or hash)
customer_id	STRING	Natural key
legal_name	STRING	From CRM
email	STRING	From CRM
country	STRING	From CRM
kyc_status	STRING	From KYC provider
risk_tier	STRING	From risk engine
effective_from_ts	TIMESTAMP	Start of validity
effective_to_ts	TIMESTAMP	End of validity (open-ended for current)
is_current	BOOLEAN	Convenience flag
record_hash	STRING	Hash of tracked attributes
source_max_event_ts	TIMESTAMP	Latest contributing event time
loaded_at	TIMESTAMP	Warehouse load time

Event tables have:

customer_id
event_ts (when the upstream system says the change happened)
ingested_at (when it landed in Snowflake)
payload fields
event_id (not always unique across sources; some sources replay)

Known issues:

Duplicate events (replays)
Out-of-order events (event_ts not increasing)
Null/blank overwrites (CRM sometimes sends empty email)
Occasional customer_id reuse is not allowed (assume stable natural key)

Your Task (What You Must Design)

Design a complete SCD Type 2 implementation for analytics.dim_customer that supports:

Functional requirements

Type 2 history for tracked attributes: legal_name, email, country, kyc_status, risk_tier.
Correct handling of late-arriving and out-of-order updates using event_ts (not ingested_at) as the business-effective time.
Idempotent daily runs: re-running the same Airflow task for a date must not create duplicate dimension rows.
Deduplication across sources (replayed events) using a deterministic rule.
Produce a current view (dim_customer_current) and support point-in-time joins (as-of joins) from fact tables.

Non-functional requirements

Pipeline completes within 90 minutes on typical days.
Supports backfills of up to 30 days without full rebuild.
Includes data quality checks and clear failure modes.
Minimizes Snowflake cost (avoid full-table merges/scans when possible).

Constraints

Warehouse: Snowflake Enterprise; you can create streams/tasks but the org standard is Airflow + dbt.
Team skillset: strong SQL/dbt; limited Spark.
Compliance: must be able to explain lineage and logic to auditors; keep an audit trail of changes.
Facts: downstream fact tables (payments, disputes) join by customer_id and event_ts.

Interview Prompts

In your answer, walk through:

The table design and keys (surrogate key strategy, unique constraints you enforce logically).
The incremental processing strategy (what window you reprocess daily to capture late arrivals).
The merge logic for SCD2 (closing prior records, inserting new versions).
How you handle late events that fall inside already-closed intervals (e.g., an event arrives today with event_ts from 5 days ago).
How you ensure idempotency and prevent duplicates.
What data quality checks you implement and what happens on failure.
How you would monitor and tune performance in Snowflake.

You may assume you can create intermediate staging tables in analytics_staging and that dbt runs daily via Airflow.

Context

Today, customer attributes are stored in a “latest snapshot” table (customer_current). This causes two major problems:

Analysts can’t answer time-travel questions like “What risk tier did the customer have when they made a chargeback?”
Compliance audits require proving what the company knew at the time (e.g., KYC status as-of a transaction date). A recent audit found gaps, and the business has committed to fixing this within a quarter.

Data Sources & Current Architecture

Upstream systems (all land into Snowflake RAW schema via existing ingestion):

Customer Service CRM: updates to name/email/phone; can backfill corrections days later.
KYC provider: status changes (PENDING → VERIFIED → REJECTED), sometimes replays events.
Risk engine: risk tier updates (LOW/MED/HIGH) with occasional retroactive corrections.

RAW tables are append-only and partitioned by ingestion time:

raw.crm_customer_updates
raw.kyc_status_events
raw.risk_tier_events

A daily Airflow DAG currently builds analytics.customer_current by taking the latest record per customer from each source and joining them.

Scale Requirements

Customers: ~120M total historical customers, ~25M active monthly
Change volume: ~8–15M attribute-change events/day across all sources
Late arrivals: up to 7 days late (rarely 30 days for compliance corrections)
Freshness: SCD2 dimension must be queryable by analysts by 06:00 UTC daily
Retention: RAW retained 2 years; SCD2 retained indefinitely
SLA: 99.5% daily pipeline success; recovery (re-run) within 2 hours

Data Characteristics

You will build analytics.dim_customer with these columns (minimum):

Column	Type	Notes
customer_sk	NUMBER	Surrogate key (monotonic sequence or hash)
customer_id	STRING	Natural key
legal_name	STRING	From CRM
email	STRING	From CRM
country	STRING	From CRM
kyc_status	STRING	From KYC provider
risk_tier	STRING	From risk engine
effective_from_ts	TIMESTAMP	Start of validity
effective_to_ts	TIMESTAMP	End of validity (open-ended for current)
is_current	BOOLEAN	Convenience flag
record_hash	STRING	Hash of tracked attributes
source_max_event_ts	TIMESTAMP	Latest contributing event time
loaded_at	TIMESTAMP	Warehouse load time

Event tables have:

customer_id
event_ts (when the upstream system says the change happened)
ingested_at (when it landed in Snowflake)
payload fields
event_id (not always unique across sources; some sources replay)

Known issues:

Duplicate events (replays)
Out-of-order events (event_ts not increasing)
Null/blank overwrites (CRM sometimes sends empty email)
Occasional customer_id reuse is not allowed (assume stable natural key)

Your Task (What You Must Design)

Design a complete SCD Type 2 implementation for analytics.dim_customer that supports:

Functional requirements

Type 2 history for tracked attributes: legal_name, email, country, kyc_status, risk_tier.
Correct handling of late-arriving and out-of-order updates using event_ts (not ingested_at) as the business-effective time.
Idempotent daily runs: re-running the same Airflow task for a date must not create duplicate dimension rows.
Deduplication across sources (replayed events) using a deterministic rule.
Produce a current view (dim_customer_current) and support point-in-time joins (as-of joins) from fact tables.

Non-functional requirements

Pipeline completes within 90 minutes on typical days.
Supports backfills of up to 30 days without full rebuild.
Includes data quality checks and clear failure modes.
Minimizes Snowflake cost (avoid full-table merges/scans when possible).

Constraints

Warehouse: Snowflake Enterprise; you can create streams/tasks but the org standard is Airflow + dbt.
Team skillset: strong SQL/dbt; limited Spark.
Compliance: must be able to explain lineage and logic to auditors; keep an audit trail of changes.
Facts: downstream fact tables (payments, disputes) join by customer_id and event_ts.

Interview Prompts

In your answer, walk through:

The table design and keys (surrogate key strategy, unique constraints you enforce logically).
The incremental processing strategy (what window you reprocess daily to capture late arrivals).
The merge logic for SCD2 (closing prior records, inserting new versions).
How you handle late events that fall inside already-closed intervals (e.g., an event arrives today with event_ts from 5 days ago).
How you ensure idempotency and prevent duplicates.
What data quality checks you implement and what happens on failure.
How you would monitor and tune performance in Snowflake.

You may assume you can create intermediate staging tables in analytics_staging and that dbt runs daily via Airflow.

Interview Guides

Context

Data Sources & Current Architecture

Scale Requirements

Data Characteristics

Your Task (What You Must Design)

Functional requirements

Non-functional requirements

Constraints

Interview Prompts

Build SCD2 Customer Dimension in Snowflake

Context

Data Sources & Current Architecture

Scale Requirements

Data Characteristics

Your Task (What You Must Design)

Functional requirements

Non-functional requirements

Constraints

Interview Prompts

Your Answer

Build SCD2 Customer Dimension in Snowflake

Context

Data Sources & Current Architecture

Scale Requirements

Data Characteristics

Your Task (What You Must Design)

Functional requirements

Non-functional requirements

Constraints

Interview Prompts

Build SCD2 Customer Dimension in Snowflake

Context

Data Sources & Current Architecture

Scale Requirements

Data Characteristics

Your Task (What You Must Design)

Functional requirements

Non-functional requirements

Constraints

Interview Prompts

Your Answer