Context
FinEdge, a B2B payments company, runs nightly and hourly ETL pipelines that ingest transaction, customer, and ledger data from PostgreSQL, S3, and third-party APIs into Snowflake. The current stack uses Apache Airflow for orchestration, dbt for transformations, and Python-based ingestion jobs, but recent production incidents have exposed weak automated test coverage across schema changes, data quality checks, and DAG deployments.
The data platform team wants a testing strategy that catches failures before production while keeping deployment velocity high. Your task is to design an automated testing approach for the pipeline stack and explain what tools you would use, where tests should run, and how failures should be surfaced.
Scale Requirements
- Sources: 12 internal tables, 4 external APIs, 3 S3 batch feeds
- Volume: ~250M rows/day, ~1.2 TB/day raw data
- Pipeline frequency: 8 hourly DAGs, 3 daily DAGs
- Latency target: Hourly pipelines available in Snowflake within 15 minutes
- Change rate: 20-30 dbt model changes and 5-10 DAG/code changes per week
- Reliability target: 99.9% successful scheduled runs
Requirements
- Design an automated testing strategy for Python ETL code, Airflow DAGs, and dbt models.
- Include unit, integration, and data quality tests, and specify where each test runs in CI/CD.
- Validate schema compatibility, null handling, duplicate detection, and referential integrity before production deployment.
- Ensure test failures block promotion to production and provide actionable error output.
- Support safe backfills and idempotent reruns after failed loads.
- Describe how you would test both batch ingestion jobs and downstream warehouse transformations.
Constraints
- Infrastructure is AWS-based with GitHub Actions already in place
- Team size is 3 data engineers and 1 analytics engineer
- Budget favors managed services already in use over adding large new platforms
- SOX-related financial datasets require auditability of test results and deployments