Context
ShopNow, a mid-size e-commerce marketplace, loads order, payment, and fulfillment data from PostgreSQL and third-party APIs into Snowflake for finance and operations reporting. The current pipeline is a set of daily Python cron jobs that frequently produce duplicate rows, miss late-arriving updates, and require manual reruns after failures.
You are asked to redesign this pipeline so reporting is reliable by 7:00 AM UTC each day, while preserving historical correctness for refunds, status changes, and backfilled data.
Scale Requirements
- Sources: PostgreSQL orders DB, Stripe API, Shippo API
- Volume: 25M orders total, 350K new/updated orders per day
- Data size: ~180 GB raw/day across all sources
- Latency target: Daily warehouse tables ready by 7:00 AM UTC; incremental loads every 30 minutes for operational dashboards
- Retention: 2 years raw data, 5 years curated finance tables
Requirements
- Design an ETL/ELT pipeline that ingests incremental changes from all three sources into Snowflake.
- Ensure idempotent reruns so duplicate records are not created after retries or backfills.
- Handle late-arriving updates such as refunds, chargebacks, and shipment status changes up to 30 days after the original order.
- Implement data quality checks for schema drift, null primary keys, duplicate business keys, and reconciliation between orders and payments.
- Orchestrate dependencies so downstream finance tables run only after raw ingestion and validation succeed.
- Support backfills for a specific date range without corrupting current production tables.
- Provide monitoring, alerting, and failure recovery procedures.
Constraints
- Existing stack is AWS-based and Snowflake is already the company warehouse.
- Team size is 3 data engineers; operational complexity should stay moderate.
- Budget allows managed services but not a large always-on Spark cluster.
- Finance reports are SOX-audited, so lineage, reproducibility, and audit logs are required.