Context
Northstar Retail ingests order events from a transactional PostgreSQL database and partner-delivered CSV files into Snowflake for finance and operations reporting. The current Airflow-managed batch pipeline occasionally re-runs failed tasks and backfills date ranges, causing duplicate facts, inconsistent aggregates, and manual cleanup.
You need to redesign the pipeline so repeated execution of the same job produces the same final state. The goal is to make ingestion, transformation, and loading idempotent across normal retries, partial failures, and historical backfills.
Scale Requirements
- Sources: PostgreSQL CDC extract + 3 partner SFTP CSV feeds
- Volume: 25M order line records/day, 400 GB raw/day
- Peak batch window: 2 hours for hourly loads
- Latency target: Data available in Snowflake within 15 minutes of source extract completion
- Retention: 1 year raw in S3, 7 years curated in Snowflake
- Backfill scope: Up to 180 days without corrupting downstream tables
Requirements
- Design an idempotent ingestion layer that can safely reprocess the same source files or extract windows.
- Ensure duplicate records are not introduced when Airflow retries tasks or when operators manually rerun DAGs.
- Support upserts and deletes for order status changes, cancellations, and refunds.
- Define how to generate and persist deterministic business keys, load batch IDs, and deduplication logic.
- Build warehouse transformations so downstream fact tables remain correct after replay or backfill.
- Include data quality checks, monitoring, and recovery procedures for partial loads.
- Explain where idempotency must be enforced: source extraction, landing, staging, merge, and orchestration.
Constraints
- AWS-first environment using existing S3, Airflow, and Snowflake
- Team of 3 data engineers; solution should minimize operational complexity
- No full-table reloads for curated tables larger than 5 TB
- SOX compliance requires auditable load history and reproducible reruns