Context
LedgerLoop, a B2B SaaS billing platform, loads customer, invoice, payment, and subscription data from PostgreSQL, Stripe, and Salesforce into Snowflake for finance and product reporting. The current pipeline uses nightly Airflow jobs and ad hoc SQL checks, but stakeholders regularly find duplicate invoices, null customer keys, and mismatched revenue totals.
You need to redesign the pipeline so data quality and integrity are enforced systematically across ingestion, transformation, and warehouse publishing.
Scale Requirements
- Sources: PostgreSQL OLTP, Stripe API, Salesforce API, S3 CSV partner feeds
- Volume: ~120M rows/day across all sources
- Batch frequency: Hourly incremental loads, nightly full reconciliation
- Latency target: Source to trusted warehouse tables in < 15 minutes for hourly loads
- Storage: 25 TB historical raw data, 7-year finance retention
- Data quality SLA: Critical tables must maintain 99.9% valid records and zero duplicate primary business keys in curated layers
Requirements
- Design an ETL/ELT pipeline that validates schema, nullability, referential integrity, and duplicate business keys before publishing curated tables.
- Support idempotent re-runs for failed hourly jobs without creating duplicate records.
- Implement source-to-target reconciliation for row counts, invoice totals, and payment totals.
- Quarantine bad records for triage while allowing valid records to continue downstream.
- Add automated tests for raw, staging, and curated layers, including freshness and anomaly checks.
- Define monitoring, alerting, and recovery for failed loads, late-arriving data, and schema drift.
- Explain how you would backfill 6 months of invoice history safely.
Constraints
- Existing stack is AWS + Snowflake; avoid introducing more than one major new platform.
- Team size is 3 data engineers and 1 analytics engineer.
- Finance data is SOX-relevant, so all quality checks and manual overrides must be auditable.
- Budget allows moderate warehouse scaling, but compute-heavy full refreshes should be minimized.