Context
FinSight, a B2B analytics company, runs daily and hourly ETL pipelines that ingest CRM, billing, and product-usage data into Snowflake. The team currently shares a fragile staging environment where test data is inconsistent, upstream dependencies are manually refreshed, and pipeline validation often fails for environment-related reasons rather than code defects.
You need to design a reliable testing environment for data pipelines so engineers can validate schema changes, transformation logic, orchestration behavior, and data quality checks before production deployment.
Scale Requirements
- Pipelines: 120 Airflow DAGs, including 80 batch ETL jobs and 40 ELT/dbt workflows
- Data volume: 2 TB/day in production; test environment should support representative subsets of 50-100 GB/day
- Freshness: Test environment should be refreshable within 30 minutes
- Concurrency: 20 engineers running parallel test workflows
- Recovery target: Failed environment setup should recover within 15 minutes
- Retention: Keep test datasets and logs for 14 days
Requirements
- Design an isolated test environment for Airflow, Spark, dbt, and Snowflake workloads.
- Ensure test data is deterministic, masked, production-like, and safe for repeated runs.
- Support automated environment readiness checks before pipeline execution.
- Validate orchestration dependencies, idempotent reruns, and backfill behavior.
- Include data quality gates for schema drift, null spikes, duplicate records, and row-count anomalies.
- Provide CI/CD integration so pull requests can trigger pipeline tests automatically.
- Define monitoring, alerting, and recovery procedures for environment failures.
Constraints
- Infrastructure must stay on AWS using existing Airflow 2.x, EMR Spark, dbt Core, and Snowflake.
- Monthly incremental budget is capped at $12K.
- Production PII cannot be copied directly; all test data must be masked or synthetically generated.
- Small team: 3 data engineers and 1 platform engineer, so operational complexity should be limited.