Context
Northstar Retail, a mid-market e-commerce company, has an orders and fulfillment reporting pipeline built on nightly batch ETL. The current workflow uses Apache Airflow to orchestrate Python jobs that extract from PostgreSQL, payment APIs, and warehouse management exports, then load curated tables into Snowflake. The pipeline regularly misses the 7 a.m. business SLA, produces duplicate orders during retries, and cannot support same-day operational reporting.
You are asked to redesign the workflow so it meets business requirements for freshness, correctness, and recoverability without replacing the entire AWS-based platform.
Scale Requirements
- Sources: PostgreSQL OLTP, Stripe API, S3 CSV exports from 3PL partners
- Volume: 45M order line items/day, 1.2TB raw data/day
- Peak ingest: 8K order events/sec during promotions
- Latency target: curated operational tables available within 15 minutes of source updates
- Batch SLA: full daily financial reconciliation completed by 6:30 a.m. UTC
- Retention: 2 years curated data, 90 days raw landing zone
Requirements
- Design a revised pipeline that supports both incremental near-real-time ingestion and daily reconciliation.
- Ensure idempotent processing so retries do not create duplicate facts or double-count revenue.
- Handle late-arriving updates such as refunds, cancellations, and shipment status changes.
- Implement data quality checks for schema drift, null business keys, duplicate order IDs, and reconciliation mismatches.
- Support historical backfills for 30 days without disrupting current production loads.
- Define orchestration dependencies, retry behavior, and rollback/reprocessing strategy.
- Expose analytics-ready tables for finance, operations, and customer support teams.
Constraints
- Must stay on AWS + Snowflake; no full platform migration.
- Team size: 3 data engineers, 1 analytics engineer.
- Incremental infrastructure budget: <$18K/month.
- PCI-sensitive payment fields cannot be stored in raw logs beyond tokenized values.
- Source PostgreSQL cannot tolerate heavy read load or long-running extraction queries.