Context
NovaBank runs 1,200+ daily regression checks across batch ETL pipelines that publish finance, risk, and customer reporting datasets to Snowflake. The current validation process is fully sequential in Apache Airflow, reprocesses unchanged data, and takes 8 hours end-to-end, delaying releases and incident recovery.
You need to redesign the validation pipeline so regression cycle time drops to 4 hours or less without lowering data quality. The platform already uses AWS, Airflow, dbt, Spark, S3, and Snowflake, and the new design must fit that ecosystem.
Scale Requirements
- Pipelines validated: 180 critical DAGs per day
- Validation suites: ~1,200 checks/run across schema, row-count, reconciliation, freshness, and business-rule tests
- Data volume: 35 TB/day raw, 9 TB/day curated
- Largest table: 14B rows, 4.5 TB compressed
- Latency target: Full regression validation in <= 4 hours
- Freshness SLA: Critical datasets validated within 30 minutes of load completion
- Retention: Validation artifacts and metrics stored for 180 days
Requirements
- Design a validation architecture that parallelizes checks across datasets and environments while preserving deterministic results.
- Support incremental validation so unchanged partitions/models are skipped safely.
- Implement data quality gates for schema drift, null spikes, duplicate keys, referential integrity, and source-to-target reconciliation.
- Provide idempotent reruns for failed validation stages without revalidating successful assets.
- Store validation outcomes, lineage, and historical baselines for trend analysis and release approval.
- Define orchestration, alerting, rollback criteria, and how batch backfills are validated differently from daily runs.
- Explain how you would prevent false positives from late-arriving data and expected seasonal volume changes.
Constraints
- Must remain primarily on AWS + Snowflake; no large platform migration
- Incremental cloud spend capped at $30K/month
- SOX-sensitive finance datasets require auditable validation logs and role-based access
- Team size is 5 data engineers; operational complexity should stay manageable