Design Idempotent Orders ETL Pipeline

Context

Northstar Retail ingests order events from a transactional PostgreSQL database and partner-delivered CSV files into Snowflake for finance and operations reporting. The current Airflow-managed batch pipeline occasionally re-runs failed tasks and backfills date ranges, causing duplicate facts, inconsistent aggregates, and manual cleanup.

You need to redesign the pipeline so repeated execution of the same job produces the same final state. The goal is to make ingestion, transformation, and loading idempotent across normal retries, partial failures, and historical backfills.

Scale Requirements

Sources: PostgreSQL CDC extract + 3 partner SFTP CSV feeds
Volume: 25M order line records/day, 400 GB raw/day
Peak batch window: 2 hours for hourly loads
Latency target: Data available in Snowflake within 15 minutes of source extract completion
Retention: 1 year raw in S3, 7 years curated in Snowflake
Backfill scope: Up to 180 days without corrupting downstream tables

Requirements

Design an idempotent ingestion layer that can safely reprocess the same source files or extract windows.
Ensure duplicate records are not introduced when Airflow retries tasks or when operators manually rerun DAGs.
Support upserts and deletes for order status changes, cancellations, and refunds.
Define how to generate and persist deterministic business keys, load batch IDs, and deduplication logic.
Build warehouse transformations so downstream fact tables remain correct after replay or backfill.
Include data quality checks, monitoring, and recovery procedures for partial loads.
Explain where idempotency must be enforced: source extraction, landing, staging, merge, and orchestration.

Constraints

AWS-first environment using existing S3, Airflow, and Snowflake
Team of 3 data engineers; solution should minimize operational complexity
No full-table reloads for curated tables larger than 5 TB
SOX compliance requires auditable load history and reproducible reruns

Context

Scale Requirements

Sources: PostgreSQL CDC extract + 3 partner SFTP CSV feeds
Volume: 25M order line records/day, 400 GB raw/day
Peak batch window: 2 hours for hourly loads
Latency target: Data available in Snowflake within 15 minutes of source extract completion
Retention: 1 year raw in S3, 7 years curated in Snowflake
Backfill scope: Up to 180 days without corrupting downstream tables

Requirements

Design an idempotent ingestion layer that can safely reprocess the same source files or extract windows.
Ensure duplicate records are not introduced when Airflow retries tasks or when operators manually rerun DAGs.
Support upserts and deletes for order status changes, cancellations, and refunds.
Define how to generate and persist deterministic business keys, load batch IDs, and deduplication logic.
Build warehouse transformations so downstream fact tables remain correct after replay or backfill.
Include data quality checks, monitoring, and recovery procedures for partial loads.
Explain where idempotency must be enforced: source extraction, landing, staging, merge, and orchestration.

Constraints

AWS-first environment using existing S3, Airflow, and Snowflake
Team of 3 data engineers; solution should minimize operational complexity
No full-table reloads for curated tables larger than 5 TB
SOX compliance requires auditable load history and reproducible reruns

Context

Scale Requirements

Sources: PostgreSQL CDC extract + 3 partner SFTP CSV feeds
Volume: 25M order line records/day, 400 GB raw/day
Peak batch window: 2 hours for hourly loads
Latency target: Data available in Snowflake within 15 minutes of source extract completion
Retention: 1 year raw in S3, 7 years curated in Snowflake
Backfill scope: Up to 180 days without corrupting downstream tables

Requirements

Design an idempotent ingestion layer that can safely reprocess the same source files or extract windows.
Ensure duplicate records are not introduced when Airflow retries tasks or when operators manually rerun DAGs.
Support upserts and deletes for order status changes, cancellations, and refunds.
Define how to generate and persist deterministic business keys, load batch IDs, and deduplication logic.
Build warehouse transformations so downstream fact tables remain correct after replay or backfill.
Include data quality checks, monitoring, and recovery procedures for partial loads.
Explain where idempotency must be enforced: source extraction, landing, staging, merge, and orchestration.

Constraints

AWS-first environment using existing S3, Airflow, and Snowflake
Team of 3 data engineers; solution should minimize operational complexity
No full-table reloads for curated tables larger than 5 TB
SOX compliance requires auditable load history and reproducible reruns

Context

Scale Requirements

Sources: PostgreSQL CDC extract + 3 partner SFTP CSV feeds
Volume: 25M order line records/day, 400 GB raw/day
Peak batch window: 2 hours for hourly loads
Latency target: Data available in Snowflake within 15 minutes of source extract completion
Retention: 1 year raw in S3, 7 years curated in Snowflake
Backfill scope: Up to 180 days without corrupting downstream tables

Requirements

Design an idempotent ingestion layer that can safely reprocess the same source files or extract windows.
Ensure duplicate records are not introduced when Airflow retries tasks or when operators manually rerun DAGs.
Support upserts and deletes for order status changes, cancellations, and refunds.
Define how to generate and persist deterministic business keys, load batch IDs, and deduplication logic.
Build warehouse transformations so downstream fact tables remain correct after replay or backfill.
Include data quality checks, monitoring, and recovery procedures for partial loads.
Explain where idempotency must be enforced: source extraction, landing, staging, merge, and orchestration.

Constraints

AWS-first environment using existing S3, Airflow, and Snowflake
Team of 3 data engineers; solution should minimize operational complexity
No full-table reloads for curated tables larger than 5 TB
SOX compliance requires auditable load history and reproducible reruns

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Idempotent Orders ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Idempotent Orders ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Idempotent Orders ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer