Fix Failing Orders ETL Pipeline

Context

ShopNow, a mid-size e-commerce marketplace, loads order, payment, and fulfillment data from PostgreSQL and third-party APIs into Snowflake for finance and operations reporting. The current pipeline is a set of daily Python cron jobs that frequently produce duplicate rows, miss late-arriving updates, and require manual reruns after failures.

You are asked to redesign this pipeline so reporting is reliable by 7:00 AM UTC each day, while preserving historical correctness for refunds, status changes, and backfilled data.

Scale Requirements

Sources: PostgreSQL orders DB, Stripe API, Shippo API
Volume: 25M orders total, 350K new/updated orders per day
Data size: ~180 GB raw/day across all sources
Latency target: Daily warehouse tables ready by 7:00 AM UTC; incremental loads every 30 minutes for operational dashboards
Retention: 2 years raw data, 5 years curated finance tables

Requirements

Design an ETL/ELT pipeline that ingests incremental changes from all three sources into Snowflake.
Ensure idempotent reruns so duplicate records are not created after retries or backfills.
Handle late-arriving updates such as refunds, chargebacks, and shipment status changes up to 30 days after the original order.
Implement data quality checks for schema drift, null primary keys, duplicate business keys, and reconciliation between orders and payments.
Orchestrate dependencies so downstream finance tables run only after raw ingestion and validation succeed.
Support backfills for a specific date range without corrupting current production tables.
Provide monitoring, alerting, and failure recovery procedures.

Constraints

Existing stack is AWS-based and Snowflake is already the company warehouse.
Team size is 3 data engineers; operational complexity should stay moderate.
Budget allows managed services but not a large always-on Spark cluster.
Finance reports are SOX-audited, so lineage, reproducibility, and audit logs are required.

Context

You are asked to redesign this pipeline so reporting is reliable by 7:00 AM UTC each day, while preserving historical correctness for refunds, status changes, and backfilled data.

Scale Requirements

Sources: PostgreSQL orders DB, Stripe API, Shippo API
Volume: 25M orders total, 350K new/updated orders per day
Data size: ~180 GB raw/day across all sources
Latency target: Daily warehouse tables ready by 7:00 AM UTC; incremental loads every 30 minutes for operational dashboards
Retention: 2 years raw data, 5 years curated finance tables

Requirements

Design an ETL/ELT pipeline that ingests incremental changes from all three sources into Snowflake.
Ensure idempotent reruns so duplicate records are not created after retries or backfills.
Handle late-arriving updates such as refunds, chargebacks, and shipment status changes up to 30 days after the original order.
Implement data quality checks for schema drift, null primary keys, duplicate business keys, and reconciliation between orders and payments.
Orchestrate dependencies so downstream finance tables run only after raw ingestion and validation succeed.
Support backfills for a specific date range without corrupting current production tables.
Provide monitoring, alerting, and failure recovery procedures.

Constraints

Existing stack is AWS-based and Snowflake is already the company warehouse.
Team size is 3 data engineers; operational complexity should stay moderate.
Budget allows managed services but not a large always-on Spark cluster.
Finance reports are SOX-audited, so lineage, reproducibility, and audit logs are required.

Context

You are asked to redesign this pipeline so reporting is reliable by 7:00 AM UTC each day, while preserving historical correctness for refunds, status changes, and backfilled data.

Scale Requirements

Sources: PostgreSQL orders DB, Stripe API, Shippo API
Volume: 25M orders total, 350K new/updated orders per day
Data size: ~180 GB raw/day across all sources
Latency target: Daily warehouse tables ready by 7:00 AM UTC; incremental loads every 30 minutes for operational dashboards
Retention: 2 years raw data, 5 years curated finance tables

Requirements

Design an ETL/ELT pipeline that ingests incremental changes from all three sources into Snowflake.
Ensure idempotent reruns so duplicate records are not created after retries or backfills.
Handle late-arriving updates such as refunds, chargebacks, and shipment status changes up to 30 days after the original order.
Implement data quality checks for schema drift, null primary keys, duplicate business keys, and reconciliation between orders and payments.
Orchestrate dependencies so downstream finance tables run only after raw ingestion and validation succeed.
Support backfills for a specific date range without corrupting current production tables.
Provide monitoring, alerting, and failure recovery procedures.

Constraints

Existing stack is AWS-based and Snowflake is already the company warehouse.
Team size is 3 data engineers; operational complexity should stay moderate.
Budget allows managed services but not a large always-on Spark cluster.
Finance reports are SOX-audited, so lineage, reproducibility, and audit logs are required.

Context

You are asked to redesign this pipeline so reporting is reliable by 7:00 AM UTC each day, while preserving historical correctness for refunds, status changes, and backfilled data.

Scale Requirements

Sources: PostgreSQL orders DB, Stripe API, Shippo API
Volume: 25M orders total, 350K new/updated orders per day
Data size: ~180 GB raw/day across all sources
Latency target: Daily warehouse tables ready by 7:00 AM UTC; incremental loads every 30 minutes for operational dashboards
Retention: 2 years raw data, 5 years curated finance tables

Requirements

Design an ETL/ELT pipeline that ingests incremental changes from all three sources into Snowflake.
Ensure idempotent reruns so duplicate records are not created after retries or backfills.
Handle late-arriving updates such as refunds, chargebacks, and shipment status changes up to 30 days after the original order.
Implement data quality checks for schema drift, null primary keys, duplicate business keys, and reconciliation between orders and payments.
Orchestrate dependencies so downstream finance tables run only after raw ingestion and validation succeed.
Support backfills for a specific date range without corrupting current production tables.
Provide monitoring, alerting, and failure recovery procedures.

Constraints

Existing stack is AWS-based and Snowflake is already the company warehouse.
Team size is 3 data engineers; operational complexity should stay moderate.
Budget allows managed services but not a large always-on Spark cluster.
Finance reports are SOX-audited, so lineage, reproducibility, and audit logs are required.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Fix Failing Orders ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Fix Failing Orders ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Fix Failing Orders ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer