Context
Northstar Retail generates daily finance and operations reports from transactional data in PostgreSQL, Stripe, and Salesforce. The current Airflow batch pipeline loads data into Snowflake and runs dbt models, but report consumers regularly find missing rows, duplicate transactions, and metric drift after source schema changes.
You need to redesign the reporting pipeline so recurring reports are published only after data integrity checks pass, while keeping the system simple enough for a small data team to operate.
Scale Requirements
- Sources: PostgreSQL OLTP (~120M orders total), Stripe payouts/refunds (~8M records), Salesforce account updates (~15M records)
- Daily ingest volume: ~25M changed rows/day, ~180 GB raw/day
- Reporting cadence: 3 daily reports at 6 AM, 12 PM, and 6 PM UTC
- Latency target: Source extract to validated report tables in < 45 minutes
- Historical retention: 3 years in Snowflake, 90 days of raw staged files in S3
- Data quality SLA: <0.1% unexplained row-count variance vs source systems
Requirements
- Design a batch ETL/ELT pipeline that incrementally ingests source changes into Snowflake.
- Implement automated integrity validation for every run, including row-count reconciliation, primary-key uniqueness, null checks on critical fields, freshness checks, and aggregate balancing for revenue metrics.
- Prevent downstream report publication when validation fails, and provide a quarantine/error workflow for bad batches.
- Support backfills for a date range without creating duplicates or corrupting published reports.
- Define orchestration, retry behavior, and idempotent loading semantics.
- Include monitoring, alerting, and auditability so analysts can trace which source snapshot produced each report.
Constraints
- Existing stack is AWS + Airflow 2.x + Snowflake + dbt; avoid introducing more than one major new platform.
- Team size: 3 data engineers, 2 analytics engineers.
- Finance reports are SOX-sensitive; every published dataset must be versioned and reproducible.
- Budget allows moderate warehouse scaling, but no always-on large compute clusters.