Interview Guides

Manage Daily Orders ETL Pipeline

Easy

Pipelines

Context

Northstar Retail, a mid-size e-commerce company, loads order, payment, and shipment data from PostgreSQL and third-party APIs into Snowflake for finance and operations reporting. The current pipeline is a collection of cron jobs and ad hoc SQL scripts, causing missed runs, duplicate loads, and poor visibility when failures occur.

You are asked to redesign how the team manages this pipeline so it is reliable, observable, and easy to operate by a small data engineering team.

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), Stripe API, Shippo API, S3 CSV drops from partners
Volume: ~80M order-related rows/day, ~600 GB raw data/day
Batch frequency: core facts every 15 minutes; full reconciliation nightly
Latency target: source-to-warehouse under 20 minutes for incremental loads
Retention: 2 years in Snowflake, 90 days raw landing in S3
Team size: 3 data engineers, 1 analytics engineer

Requirements

Design a managed pipeline for extracting incremental data from PostgreSQL, APIs, and S3 into Snowflake.
Define orchestration for task dependencies, retries, backfills, and SLA tracking.
Ensure idempotent loads so reruns do not create duplicate facts or inconsistent dimensions.
Add data quality checks for schema drift, null spikes, row-count anomalies, and reconciliation against source totals.
Support nightly backfills for up to 30 days without blocking scheduled production loads.
Provide operational visibility: run status, freshness, failure root cause, and lineage from source to reporting tables.
Expose curated tables for finance dashboards and downstream dbt models.

Constraints

Must use AWS and Snowflake; no self-managed Hadoop cluster
Incremental cloud budget increase capped at $18K/month
Finance reports are SOX-relevant, so auditability and reproducibility are required
Source PostgreSQL cannot tolerate heavy read load during business hours
Third-party APIs have rate limits: Stripe 100 req/s, Shippo 60 req/s

Manage Daily Orders ETL Pipeline

Easy

Pipelines

Context

You are asked to redesign how the team manages this pipeline so it is reliable, observable, and easy to operate by a small data engineering team.

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), Stripe API, Shippo API, S3 CSV drops from partners
Volume: ~80M order-related rows/day, ~600 GB raw data/day
Batch frequency: core facts every 15 minutes; full reconciliation nightly
Latency target: source-to-warehouse under 20 minutes for incremental loads
Retention: 2 years in Snowflake, 90 days raw landing in S3
Team size: 3 data engineers, 1 analytics engineer

Requirements

Design a managed pipeline for extracting incremental data from PostgreSQL, APIs, and S3 into Snowflake.
Define orchestration for task dependencies, retries, backfills, and SLA tracking.
Ensure idempotent loads so reruns do not create duplicate facts or inconsistent dimensions.
Add data quality checks for schema drift, null spikes, row-count anomalies, and reconciliation against source totals.
Support nightly backfills for up to 30 days without blocking scheduled production loads.
Provide operational visibility: run status, freshness, failure root cause, and lineage from source to reporting tables.
Expose curated tables for finance dashboards and downstream dbt models.

Constraints

Must use AWS and Snowflake; no self-managed Hadoop cluster
Incremental cloud budget increase capped at $18K/month
Finance reports are SOX-relevant, so auditability and reproducibility are required
Source PostgreSQL cannot tolerate heavy read load during business hours
Third-party APIs have rate limits: Stripe 100 req/s, Shippo 60 req/s

Your Answer

Manage Daily Orders ETL Pipeline

Easy

Pipelines

Context

You are asked to redesign how the team manages this pipeline so it is reliable, observable, and easy to operate by a small data engineering team.

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), Stripe API, Shippo API, S3 CSV drops from partners
Volume: ~80M order-related rows/day, ~600 GB raw data/day
Batch frequency: core facts every 15 minutes; full reconciliation nightly
Latency target: source-to-warehouse under 20 minutes for incremental loads
Retention: 2 years in Snowflake, 90 days raw landing in S3
Team size: 3 data engineers, 1 analytics engineer

Requirements

Design a managed pipeline for extracting incremental data from PostgreSQL, APIs, and S3 into Snowflake.
Define orchestration for task dependencies, retries, backfills, and SLA tracking.
Ensure idempotent loads so reruns do not create duplicate facts or inconsistent dimensions.
Add data quality checks for schema drift, null spikes, row-count anomalies, and reconciliation against source totals.
Support nightly backfills for up to 30 days without blocking scheduled production loads.
Provide operational visibility: run status, freshness, failure root cause, and lineage from source to reporting tables.
Expose curated tables for finance dashboards and downstream dbt models.

Constraints

Must use AWS and Snowflake; no self-managed Hadoop cluster
Incremental cloud budget increase capped at $18K/month
Finance reports are SOX-relevant, so auditability and reproducibility are required
Source PostgreSQL cannot tolerate heavy read load during business hours
Third-party APIs have rate limits: Stripe 100 req/s, Shippo 60 req/s

Manage Daily Orders ETL Pipeline

Easy

Pipelines

Context

You are asked to redesign how the team manages this pipeline so it is reliable, observable, and easy to operate by a small data engineering team.

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), Stripe API, Shippo API, S3 CSV drops from partners
Volume: ~80M order-related rows/day, ~600 GB raw data/day
Batch frequency: core facts every 15 minutes; full reconciliation nightly
Latency target: source-to-warehouse under 20 minutes for incremental loads
Retention: 2 years in Snowflake, 90 days raw landing in S3
Team size: 3 data engineers, 1 analytics engineer

Requirements

Design a managed pipeline for extracting incremental data from PostgreSQL, APIs, and S3 into Snowflake.
Define orchestration for task dependencies, retries, backfills, and SLA tracking.
Ensure idempotent loads so reruns do not create duplicate facts or inconsistent dimensions.
Add data quality checks for schema drift, null spikes, row-count anomalies, and reconciliation against source totals.
Support nightly backfills for up to 30 days without blocking scheduled production loads.
Provide operational visibility: run status, freshness, failure root cause, and lineage from source to reporting tables.
Expose curated tables for finance dashboards and downstream dbt models.

Constraints

Must use AWS and Snowflake; no self-managed Hadoop cluster
Incremental cloud budget increase capped at $18K/month
Finance reports are SOX-relevant, so auditability and reproducibility are required
Source PostgreSQL cannot tolerate heavy read load during business hours
Third-party APIs have rate limits: Stripe 100 req/s, Shippo 60 req/s