Context
FinSight, a B2B fintech analytics company, ingests transaction, customer, and ledger data from PostgreSQL, Stripe, and S3 CSV drops into Snowflake for finance and risk reporting. The current Airflow-managed batch pipeline loads data nightly, but recurring issues with null keys, duplicate transactions, schema drift, and partial loads have caused reporting errors and broken downstream dashboards.
You need to design a data pipeline approach that ensures data quality across ingestion, transformation, and warehouse publishing while keeping the existing batch architecture and team skills largely unchanged.
Scale Requirements
- Sources: 3 core systems (PostgreSQL OLTP, Stripe API, S3 partner files)
- Volume: ~120M rows/day across all datasets
- Batch frequency: Hourly for transactions, daily for dimensions
- Latency target: Hourly transaction tables available in Snowflake within 20 minutes of batch start
- Storage: 8 TB raw retained for 180 days; curated warehouse tables retained indefinitely
- Data quality SLA: 99.5% of scheduled loads must pass quality gates without manual intervention
Requirements
- Design ingestion and transformation stages with explicit data quality checks at each layer.
- Validate schema, nullability, uniqueness, referential integrity, and freshness before publishing curated tables.
- Prevent bad data from reaching business-facing Snowflake marts.
- Support idempotent re-runs for failed hourly and daily jobs.
- Provide observability for failed checks, row-count anomalies, and delayed upstream feeds.
- Define how quarantined records are stored, reviewed, and replayed.
- Include orchestration logic for gating downstream tasks on data quality status.
Constraints
- Existing stack is AWS + Snowflake + Airflow; avoid introducing more than one major new platform.
- Team size is 3 data engineers; solution should be operable without a dedicated platform team.
- Finance data is SOX-audited, so all quality failures and overrides must be logged.
- Monthly incremental infrastructure budget is capped at $12K.
- Source systems cannot be heavily modified; most controls must be implemented downstream.