Context
Northstar Retail, a mid-size e-commerce company, loads order, payment, and shipment data from PostgreSQL and third-party APIs into Snowflake for finance and operations reporting. The current pipeline is a collection of cron jobs and ad hoc SQL scripts, causing missed runs, duplicate loads, and poor visibility when failures occur.
You are asked to redesign how the team manages this pipeline so it is reliable, observable, and easy to operate by a small data engineering team.
Scale Requirements
- Sources: PostgreSQL OLTP (~150 tables), Stripe API, Shippo API, S3 CSV drops from partners
- Volume: ~80M order-related rows/day, ~600 GB raw data/day
- Batch frequency: core facts every 15 minutes; full reconciliation nightly
- Latency target: source-to-warehouse under 20 minutes for incremental loads
- Retention: 2 years in Snowflake, 90 days raw landing in S3
- Team size: 3 data engineers, 1 analytics engineer
Requirements
- Design a managed pipeline for extracting incremental data from PostgreSQL, APIs, and S3 into Snowflake.
- Define orchestration for task dependencies, retries, backfills, and SLA tracking.
- Ensure idempotent loads so reruns do not create duplicate facts or inconsistent dimensions.
- Add data quality checks for schema drift, null spikes, row-count anomalies, and reconciliation against source totals.
- Support nightly backfills for up to 30 days without blocking scheduled production loads.
- Provide operational visibility: run status, freshness, failure root cause, and lineage from source to reporting tables.
- Expose curated tables for finance dashboards and downstream dbt models.
Constraints
- Must use AWS and Snowflake; no self-managed Hadoop cluster
- Incremental cloud budget increase capped at $18K/month
- Finance reports are SOX-relevant, so auditability and reproducibility are required
- Source PostgreSQL cannot tolerate heavy read load during business hours
- Third-party APIs have rate limits: Stripe 100 req/s, Shippo 60 req/s