Modernize ELT with Fivetran + dbt

Context

You’re joining the Risk & Growth Analytics org at a mid-stage fintech that offers a consumer credit card and BNPL product in the US. The company has 12M cardholders, ~1.8M DAUs, and processes $4B/month in purchase volume. Decisions like credit line increases, fraud monitoring, and marketing attribution depend on accurate, fresh data.

Today, the analytics stack is a mix of bespoke Python ETL scripts, ad-hoc SQL, and a few vendor-managed connectors. Data lands inconsistently in Snowflake, transformations are hard to reproduce, and metric definitions drift across teams. The VP of Data wants to standardize on a modern data stack: Fivetran for ingestion and dbt for transformations, orchestrated by Airflow, with a strong emphasis on data quality, idempotency, and backfills.

Current state and pain points

Sources:
- Postgres (core ledger + user accounts), ~3 TB total, high write rate
- Salesforce (support + CRM)
- Stripe (payments)
- Segment (product events)
Warehouse: Snowflake, but tables are a mix of raw and curated with inconsistent naming.
Freshness: Some tables update every 6 hours; others are near real-time; no clear SLAs.
Quality issues:
- Late-arriving ledger adjustments (chargebacks, reversals) can arrive up to 14 days late.
- Duplicate events from Segment retries.
- Schema changes in Salesforce break downstream dashboards.
Business stakes:
- Incorrect delinquency metrics can trigger regulatory reporting errors.
- Marketing attribution errors can waste $500K+/month in spend.

Your task is to design a complete ingestion + transformation pipeline using Fivetran + dbt that is production-grade.

Scale requirements

Postgres CDC: peak 8K row changes/sec, average 2K/sec
Segment events: peak 120K events/sec, average 40K/sec, JSON payloads ~1–2 KB
Salesforce: ~8M objects total, API limits apply; incremental sync required
Snowflake:
- Raw data retention: 18 months
- Curated dimensional + fact models: indefinite
Latency SLAs:
- Ledger + transactions marts: P95 < 15 minutes
- Marketing attribution marts: P95 < 60 minutes
Backfills: must support reprocessing 90 days of data without manual SQL edits

Data characteristics

Example raw schemas (simplified)

Postgres: ledger_transactions (CDC)

column	type	notes
transaction_id	string	PK
user_id	string
amount_cents	int
currency	string	USD
status	string	authorized/settled/reversed
updated_at	timestamp	mutable; late updates
event_time	timestamp	business time

Segment: track events

field	type	notes
messageId	string	unique event id (not always unique in practice)
userId	string	may be null
anonymousId	string	fallback
event	string	e.g., "Card Activated"
timestamp	timestamp	client time; can be skewed
receivedAt	timestamp	server receipt time
properties	variant/json	nested

Requirements

Functional requirements

Ingest Postgres, Salesforce, Stripe, and Segment into Snowflake using Fivetran with clear raw-layer conventions.
Build a dbt project that produces:
- A ledger fact table suitable for finance and risk
- A user dimension with slowly changing attributes (at least Type 2 for key fields)
- A marketing attribution mart joining Segment + Stripe + campaigns
Handle late-arriving updates (up to 14 days) and mutable records without double counting.
Implement deduplication for Segment events and idempotent incremental loads.
Provide a backfill strategy for 90 days and a safe way to re-run models.

Non-functional requirements

Define data quality checks (dbt tests + custom checks) with ownership and escalation.
Provide observability: freshness, volume anomalies, schema drift, and lineage.
Ensure cost control in Snowflake (warehouse sizing, incremental models, clustering strategy).
Support SOX-like auditability: reproducible transformations, version control, and change reviews.

Constraints

The company is standardized on Snowflake and Airflow 2.x.
The team has strong SQL skills but limited Spark/Kafka expertise; prefer ELT where possible.
Budget target: <$60K/month incremental platform cost (Snowflake compute + Fivetran + orchestration).
Compliance: PII must be masked in non-prod; production access is role-based; deletions must propagate within 72 hours.

What to deliver (in the interview)

A target architecture using Fivetran + dbt + Snowflake + Airflow.
How you would model raw/staging/marts in dbt (naming, schemas, incremental strategy).
How you’d handle late-arriving data and mutable facts.
Concrete monitoring/alerting and failure recovery.
Example dbt model(s) and Airflow orchestration code.

Context

Current state and pain points

Sources:
- Postgres (core ledger + user accounts), ~3 TB total, high write rate
- Salesforce (support + CRM)
- Stripe (payments)
- Segment (product events)
Warehouse: Snowflake, but tables are a mix of raw and curated with inconsistent naming.
Freshness: Some tables update every 6 hours; others are near real-time; no clear SLAs.
Quality issues:
- Late-arriving ledger adjustments (chargebacks, reversals) can arrive up to 14 days late.
- Duplicate events from Segment retries.
- Schema changes in Salesforce break downstream dashboards.
Business stakes:
- Incorrect delinquency metrics can trigger regulatory reporting errors.
- Marketing attribution errors can waste $500K+/month in spend.

Your task is to design a complete ingestion + transformation pipeline using Fivetran + dbt that is production-grade.

Scale requirements

Postgres CDC: peak 8K row changes/sec, average 2K/sec
Segment events: peak 120K events/sec, average 40K/sec, JSON payloads ~1–2 KB
Salesforce: ~8M objects total, API limits apply; incremental sync required
Snowflake:
- Raw data retention: 18 months
- Curated dimensional + fact models: indefinite
Latency SLAs:
- Ledger + transactions marts: P95 < 15 minutes
- Marketing attribution marts: P95 < 60 minutes
Backfills: must support reprocessing 90 days of data without manual SQL edits

Data characteristics

Example raw schemas (simplified)

Postgres: ledger_transactions (CDC)

column	type	notes
transaction_id	string	PK
user_id	string
amount_cents	int
currency	string	USD
status	string	authorized/settled/reversed
updated_at	timestamp	mutable; late updates
event_time	timestamp	business time

Segment: track events

field	type	notes
messageId	string	unique event id (not always unique in practice)
userId	string	may be null
anonymousId	string	fallback
event	string	e.g., "Card Activated"
timestamp	timestamp	client time; can be skewed
receivedAt	timestamp	server receipt time
properties	variant/json	nested

Requirements

Functional requirements

Ingest Postgres, Salesforce, Stripe, and Segment into Snowflake using Fivetran with clear raw-layer conventions.
Build a dbt project that produces:
- A ledger fact table suitable for finance and risk
- A user dimension with slowly changing attributes (at least Type 2 for key fields)
- A marketing attribution mart joining Segment + Stripe + campaigns
Handle late-arriving updates (up to 14 days) and mutable records without double counting.
Implement deduplication for Segment events and idempotent incremental loads.
Provide a backfill strategy for 90 days and a safe way to re-run models.

Non-functional requirements

Define data quality checks (dbt tests + custom checks) with ownership and escalation.
Provide observability: freshness, volume anomalies, schema drift, and lineage.
Ensure cost control in Snowflake (warehouse sizing, incremental models, clustering strategy).
Support SOX-like auditability: reproducible transformations, version control, and change reviews.

Constraints

The company is standardized on Snowflake and Airflow 2.x.
The team has strong SQL skills but limited Spark/Kafka expertise; prefer ELT where possible.
Budget target: <$60K/month incremental platform cost (Snowflake compute + Fivetran + orchestration).
Compliance: PII must be masked in non-prod; production access is role-based; deletions must propagate within 72 hours.

What to deliver (in the interview)

A target architecture using Fivetran + dbt + Snowflake + Airflow.
How you would model raw/staging/marts in dbt (naming, schemas, incremental strategy).
How you’d handle late-arriving data and mutable facts.
Concrete monitoring/alerting and failure recovery.
Example dbt model(s) and Airflow orchestration code.

Context

Current state and pain points

Sources:
- Postgres (core ledger + user accounts), ~3 TB total, high write rate
- Salesforce (support + CRM)
- Stripe (payments)
- Segment (product events)
Warehouse: Snowflake, but tables are a mix of raw and curated with inconsistent naming.
Freshness: Some tables update every 6 hours; others are near real-time; no clear SLAs.
Quality issues:
- Late-arriving ledger adjustments (chargebacks, reversals) can arrive up to 14 days late.
- Duplicate events from Segment retries.
- Schema changes in Salesforce break downstream dashboards.
Business stakes:
- Incorrect delinquency metrics can trigger regulatory reporting errors.
- Marketing attribution errors can waste $500K+/month in spend.

Your task is to design a complete ingestion + transformation pipeline using Fivetran + dbt that is production-grade.

Scale requirements

Postgres CDC: peak 8K row changes/sec, average 2K/sec
Segment events: peak 120K events/sec, average 40K/sec, JSON payloads ~1–2 KB
Salesforce: ~8M objects total, API limits apply; incremental sync required
Snowflake:
- Raw data retention: 18 months
- Curated dimensional + fact models: indefinite
Latency SLAs:
- Ledger + transactions marts: P95 < 15 minutes
- Marketing attribution marts: P95 < 60 minutes
Backfills: must support reprocessing 90 days of data without manual SQL edits

Data characteristics

Example raw schemas (simplified)

Postgres: ledger_transactions (CDC)

column	type	notes
transaction_id	string	PK
user_id	string
amount_cents	int
currency	string	USD
status	string	authorized/settled/reversed
updated_at	timestamp	mutable; late updates
event_time	timestamp	business time

Segment: track events

field	type	notes
messageId	string	unique event id (not always unique in practice)
userId	string	may be null
anonymousId	string	fallback
event	string	e.g., "Card Activated"
timestamp	timestamp	client time; can be skewed
receivedAt	timestamp	server receipt time
properties	variant/json	nested

Requirements

Functional requirements

Ingest Postgres, Salesforce, Stripe, and Segment into Snowflake using Fivetran with clear raw-layer conventions.
Build a dbt project that produces:
- A ledger fact table suitable for finance and risk
- A user dimension with slowly changing attributes (at least Type 2 for key fields)
- A marketing attribution mart joining Segment + Stripe + campaigns
Handle late-arriving updates (up to 14 days) and mutable records without double counting.
Implement deduplication for Segment events and idempotent incremental loads.
Provide a backfill strategy for 90 days and a safe way to re-run models.

Non-functional requirements

Define data quality checks (dbt tests + custom checks) with ownership and escalation.
Provide observability: freshness, volume anomalies, schema drift, and lineage.
Ensure cost control in Snowflake (warehouse sizing, incremental models, clustering strategy).
Support SOX-like auditability: reproducible transformations, version control, and change reviews.

Constraints

The company is standardized on Snowflake and Airflow 2.x.
The team has strong SQL skills but limited Spark/Kafka expertise; prefer ELT where possible.
Budget target: <$60K/month incremental platform cost (Snowflake compute + Fivetran + orchestration).
Compliance: PII must be masked in non-prod; production access is role-based; deletions must propagate within 72 hours.

What to deliver (in the interview)

A target architecture using Fivetran + dbt + Snowflake + Airflow.
How you would model raw/staging/marts in dbt (naming, schemas, incremental strategy).
How you’d handle late-arriving data and mutable facts.
Concrete monitoring/alerting and failure recovery.
Example dbt model(s) and Airflow orchestration code.

Context

Current state and pain points

Sources:
- Postgres (core ledger + user accounts), ~3 TB total, high write rate
- Salesforce (support + CRM)
- Stripe (payments)
- Segment (product events)
Warehouse: Snowflake, but tables are a mix of raw and curated with inconsistent naming.
Freshness: Some tables update every 6 hours; others are near real-time; no clear SLAs.
Quality issues:
- Late-arriving ledger adjustments (chargebacks, reversals) can arrive up to 14 days late.
- Duplicate events from Segment retries.
- Schema changes in Salesforce break downstream dashboards.
Business stakes:
- Incorrect delinquency metrics can trigger regulatory reporting errors.
- Marketing attribution errors can waste $500K+/month in spend.

Your task is to design a complete ingestion + transformation pipeline using Fivetran + dbt that is production-grade.

Scale requirements

Postgres CDC: peak 8K row changes/sec, average 2K/sec
Segment events: peak 120K events/sec, average 40K/sec, JSON payloads ~1–2 KB
Salesforce: ~8M objects total, API limits apply; incremental sync required
Snowflake:
- Raw data retention: 18 months
- Curated dimensional + fact models: indefinite
Latency SLAs:
- Ledger + transactions marts: P95 < 15 minutes
- Marketing attribution marts: P95 < 60 minutes
Backfills: must support reprocessing 90 days of data without manual SQL edits

Data characteristics

Example raw schemas (simplified)

Postgres: ledger_transactions (CDC)

column	type	notes
transaction_id	string	PK
user_id	string
amount_cents	int
currency	string	USD
status	string	authorized/settled/reversed
updated_at	timestamp	mutable; late updates
event_time	timestamp	business time

Segment: track events

field	type	notes
messageId	string	unique event id (not always unique in practice)
userId	string	may be null
anonymousId	string	fallback
event	string	e.g., "Card Activated"
timestamp	timestamp	client time; can be skewed
receivedAt	timestamp	server receipt time
properties	variant/json	nested

Requirements

Functional requirements

Ingest Postgres, Salesforce, Stripe, and Segment into Snowflake using Fivetran with clear raw-layer conventions.
Build a dbt project that produces:
- A ledger fact table suitable for finance and risk
- A user dimension with slowly changing attributes (at least Type 2 for key fields)
- A marketing attribution mart joining Segment + Stripe + campaigns
Handle late-arriving updates (up to 14 days) and mutable records without double counting.
Implement deduplication for Segment events and idempotent incremental loads.
Provide a backfill strategy for 90 days and a safe way to re-run models.

Non-functional requirements

Define data quality checks (dbt tests + custom checks) with ownership and escalation.
Provide observability: freshness, volume anomalies, schema drift, and lineage.
Ensure cost control in Snowflake (warehouse sizing, incremental models, clustering strategy).
Support SOX-like auditability: reproducible transformations, version control, and change reviews.

Constraints

The company is standardized on Snowflake and Airflow 2.x.
The team has strong SQL skills but limited Spark/Kafka expertise; prefer ELT where possible.
Budget target: <$60K/month incremental platform cost (Snowflake compute + Fivetran + orchestration).
Compliance: PII must be masked in non-prod; production access is role-based; deletions must propagate within 72 hours.

What to deliver (in the interview)

A target architecture using Fivetran + dbt + Snowflake + Airflow.
How you would model raw/staging/marts in dbt (naming, schemas, incremental strategy).
How you’d handle late-arriving data and mutable facts.
Concrete monitoring/alerting and failure recovery.
Example dbt model(s) and Airflow orchestration code.

Interview Guides

Context

Current state and pain points

Scale requirements

Data characteristics

Example raw schemas (simplified)

Requirements

Functional requirements

Non-functional requirements

Constraints

What to deliver (in the interview)

Modernize ELT with Fivetran + dbt

Context

Current state and pain points

Scale requirements

Data characteristics

Example raw schemas (simplified)

Requirements

Functional requirements

Non-functional requirements

Constraints

What to deliver (in the interview)

Your Answer

Modernize ELT with Fivetran + dbt

Context

Current state and pain points

Scale requirements

Data characteristics

Example raw schemas (simplified)

Requirements

Functional requirements

Non-functional requirements

Constraints

What to deliver (in the interview)

Modernize ELT with Fivetran + dbt

Context

Current state and pain points

Scale requirements

Data characteristics

Example raw schemas (simplified)

Requirements

Functional requirements

Non-functional requirements

Constraints

What to deliver (in the interview)

Your Answer