Context
Northstar Retail runs a nightly batch pipeline that loads order, payment, and shipment data from PostgreSQL and S3 into Snowflake for finance and operations reporting. Over the last month, the pipeline has failed intermittently after a schema change and occasional upstream delays, causing missing daily revenue dashboards by 8 AM.
The current stack uses Apache Airflow 2.x for orchestration, AWS Glue PySpark jobs for transformations, Amazon S3 as the raw landing zone, and Snowflake for warehouse storage. You need to design how you would troubleshoot failures quickly, prevent bad data from reaching downstream tables, and make the pipeline recoverable without manual cleanup.
Scale Requirements
- Sources: PostgreSQL OLTP (~150 tables), S3 partner files (~2,000 files/day)
- Daily volume: 450M order-related records/day, ~1.8 TB raw compressed data
- SLA: Curated finance tables available by 8:00 AM UTC
- Batch window: 1:00 AM-6:30 AM UTC
- Latency target: Failed task detection within 5 minutes; full recovery within 60 minutes
- Retention: Raw data 180 days, curated tables 3 years
Requirements
- Design a batch pipeline that ingests incremental order data and partner files into Snowflake.
- Explain how you would detect and troubleshoot failures caused by schema drift, late-arriving files, duplicate loads, and task dependency issues.
- Ensure transformations are idempotent so reruns do not double-count revenue.
- Add data quality checks for row-count anomalies, null primary keys, duplicate order IDs, and reconciliation against source totals.
- Define alerting, retry behavior, backfill strategy, and dead-letter handling for bad files.
- Support partition-level reruns for a single business date without reprocessing the full pipeline.
Constraints
- AWS-first environment; no new streaming platform may be introduced
- Incremental budget cap: $12K/month
- Small team: 3 data engineers, shared DevOps support
- SOX compliance: finance tables require auditability and reproducible reruns