Build Reliable ETL Test Environment

Context

FinSight, a B2B analytics company, runs daily and hourly ETL pipelines that ingest CRM, billing, and product-usage data into Snowflake. The team currently shares a fragile staging environment where test data is inconsistent, upstream dependencies are manually refreshed, and pipeline validation often fails for environment-related reasons rather than code defects.

You need to design a reliable testing environment for data pipelines so engineers can validate schema changes, transformation logic, orchestration behavior, and data quality checks before production deployment.

Scale Requirements

Pipelines: 120 Airflow DAGs, including 80 batch ETL jobs and 40 ELT/dbt workflows
Data volume: 2 TB/day in production; test environment should support representative subsets of 50-100 GB/day
Freshness: Test environment should be refreshable within 30 minutes
Concurrency: 20 engineers running parallel test workflows
Recovery target: Failed environment setup should recover within 15 minutes
Retention: Keep test datasets and logs for 14 days

Requirements

Design an isolated test environment for Airflow, Spark, dbt, and Snowflake workloads.
Ensure test data is deterministic, masked, production-like, and safe for repeated runs.
Support automated environment readiness checks before pipeline execution.
Validate orchestration dependencies, idempotent reruns, and backfill behavior.
Include data quality gates for schema drift, null spikes, duplicate records, and row-count anomalies.
Provide CI/CD integration so pull requests can trigger pipeline tests automatically.
Define monitoring, alerting, and recovery procedures for environment failures.

Constraints

Infrastructure must stay on AWS using existing Airflow 2.x, EMR Spark, dbt Core, and Snowflake.
Monthly incremental budget is capped at $12K.
Production PII cannot be copied directly; all test data must be masked or synthetically generated.
Small team: 3 data engineers and 1 platform engineer, so operational complexity should be limited.

Context

Scale Requirements

Pipelines: 120 Airflow DAGs, including 80 batch ETL jobs and 40 ELT/dbt workflows
Data volume: 2 TB/day in production; test environment should support representative subsets of 50-100 GB/day
Freshness: Test environment should be refreshable within 30 minutes
Concurrency: 20 engineers running parallel test workflows
Recovery target: Failed environment setup should recover within 15 minutes
Retention: Keep test datasets and logs for 14 days

Requirements

Design an isolated test environment for Airflow, Spark, dbt, and Snowflake workloads.
Ensure test data is deterministic, masked, production-like, and safe for repeated runs.
Support automated environment readiness checks before pipeline execution.
Validate orchestration dependencies, idempotent reruns, and backfill behavior.
Include data quality gates for schema drift, null spikes, duplicate records, and row-count anomalies.
Provide CI/CD integration so pull requests can trigger pipeline tests automatically.
Define monitoring, alerting, and recovery procedures for environment failures.

Constraints

Infrastructure must stay on AWS using existing Airflow 2.x, EMR Spark, dbt Core, and Snowflake.
Monthly incremental budget is capped at $12K.
Production PII cannot be copied directly; all test data must be masked or synthetically generated.
Small team: 3 data engineers and 1 platform engineer, so operational complexity should be limited.

Context

Scale Requirements

Pipelines: 120 Airflow DAGs, including 80 batch ETL jobs and 40 ELT/dbt workflows
Data volume: 2 TB/day in production; test environment should support representative subsets of 50-100 GB/day
Freshness: Test environment should be refreshable within 30 minutes
Concurrency: 20 engineers running parallel test workflows
Recovery target: Failed environment setup should recover within 15 minutes
Retention: Keep test datasets and logs for 14 days

Requirements

Design an isolated test environment for Airflow, Spark, dbt, and Snowflake workloads.
Ensure test data is deterministic, masked, production-like, and safe for repeated runs.
Support automated environment readiness checks before pipeline execution.
Validate orchestration dependencies, idempotent reruns, and backfill behavior.
Include data quality gates for schema drift, null spikes, duplicate records, and row-count anomalies.
Provide CI/CD integration so pull requests can trigger pipeline tests automatically.
Define monitoring, alerting, and recovery procedures for environment failures.

Constraints

Infrastructure must stay on AWS using existing Airflow 2.x, EMR Spark, dbt Core, and Snowflake.
Monthly incremental budget is capped at $12K.
Production PII cannot be copied directly; all test data must be masked or synthetically generated.
Small team: 3 data engineers and 1 platform engineer, so operational complexity should be limited.

Context

Scale Requirements

Pipelines: 120 Airflow DAGs, including 80 batch ETL jobs and 40 ELT/dbt workflows
Data volume: 2 TB/day in production; test environment should support representative subsets of 50-100 GB/day
Freshness: Test environment should be refreshable within 30 minutes
Concurrency: 20 engineers running parallel test workflows
Recovery target: Failed environment setup should recover within 15 minutes
Retention: Keep test datasets and logs for 14 days

Requirements

Design an isolated test environment for Airflow, Spark, dbt, and Snowflake workloads.
Ensure test data is deterministic, masked, production-like, and safe for repeated runs.
Support automated environment readiness checks before pipeline execution.
Validate orchestration dependencies, idempotent reruns, and backfill behavior.
Include data quality gates for schema drift, null spikes, duplicate records, and row-count anomalies.
Provide CI/CD integration so pull requests can trigger pipeline tests automatically.
Define monitoring, alerting, and recovery procedures for environment failures.

Constraints

Infrastructure must stay on AWS using existing Airflow 2.x, EMR Spark, dbt Core, and Snowflake.
Monthly incremental budget is capped at $12K.
Production PII cannot be copied directly; all test data must be masked or synthetically generated.
Small team: 3 data engineers and 1 platform engineer, so operational complexity should be limited.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Build Reliable ETL Test Environment

Context

Scale Requirements

Requirements

Constraints

Your Answer

Build Reliable ETL Test Environment

Context

Scale Requirements

Requirements

Constraints

Build Reliable ETL Test Environment

Context

Scale Requirements

Requirements

Constraints

Your Answer