Context
LedgerBox, a B2B payments platform, ingests transaction, refund, and settlement data from PostgreSQL, Stripe APIs, and internal event logs into Snowflake for finance and operations reporting. The current pipeline relies on nightly Airflow jobs and ad hoc SQL checks, which has led to duplicate loads, schema drift issues, and reconciliation gaps between source systems and warehouse tables.
You need to redesign the pipeline so data integrity is enforced end-to-end across ingestion, transformation, and serving layers, while supporting both daily batch loads and targeted backfills.
Scale Requirements
- Sources: 3 PostgreSQL OLTP databases, Stripe REST API, S3 event exports
- Volume: 120M transaction-related records/day, ~1.5 TB raw/day
- Peak batch window: 8M rows/hour during settlement close
- Latency target: finance tables available within 15 minutes of source extract completion
- Retention: 7 years for curated finance data, 180 days for raw landing data
- Data quality SLA: critical tables must maintain 99.99% row-level accuracy
Requirements
- Design an ETL/ELT pipeline that guarantees no missing, duplicated, or partially applied records in curated tables.
- Support incremental loads using CDC or high-watermark extraction, with deterministic reprocessing for backfills.
- Implement integrity controls including schema validation, null/uniqueness checks, referential integrity, and source-to-target reconciliation.
- Ensure transformations are idempotent so retries do not create duplicate facts or inconsistent dimensions.
- Orchestrate dependencies across raw ingestion, staging validation, dbt transformations, and publish steps.
- Provide a strategy for handling late-arriving records, source corrections, and schema evolution.
- Define monitoring, alerting, and recovery procedures for failed loads and data quality breaches.
Constraints
- Stack must remain AWS + Snowflake centric
- Team size is 5 data engineers; operational complexity should be moderate
- Finance data is SOX-auditable; every load must be traceable by batch_id and source extract timestamp
- Budget allows managed services, but not a large always-on Spark cluster
- Downstream BI dashboards must never read partially validated tables