Interview Guides

Maintain Reliable ETL Test Environments

Easy

Pipelines

Context

FinSight, a B2B fintech analytics company, runs 120 Airflow-managed ETL/ELT pipelines that ingest PostgreSQL, S3, and Kafka data into Snowflake using Spark and dbt. The current shared staging environment is unstable: test data drifts from production, pipelines interfere with each other, and engineers cannot reliably validate schema changes, backfills, or data quality rules before release.

You need to design a maintainable test-environment strategy for data pipelines that supports isolated development, repeatable integration testing, and production-like validation without exposing sensitive customer data.

Scale Requirements

Pipelines: 120 scheduled DAGs, 25 critical hourly jobs, 8 streaming jobs
Data volume: 6 TB/day in production; test environment should support representative subsets of 200-500 GB/day
Concurrency: Up to 30 parallel CI test runs and 10 active developer sandboxes
Latency: CI pipeline validation must complete in < 20 minutes; ephemeral environment provisioning in < 10 minutes
Retention: Keep test artifacts and logs for 30 days; synthetic/masked datasets refreshed daily

Requirements

Design separate CI, shared integration, and developer sandbox environments for ETL and streaming pipelines.
Ensure test datasets are production-like using masked snapshots, synthetic records, and deterministic seed data.
Support automated validation for schema compatibility, idempotency, backfills, and data quality checks before deployment.
Isolate orchestration state, compute, and storage so concurrent test runs do not corrupt each other.
Define how Airflow, Spark, dbt, and Snowflake objects are namespaced and cleaned up.
Include monitoring, alerting, and failure recovery for environment drift, stale test data, and broken test pipelines.
Show how to test streaming jobs with replayable Kafka topics and deterministic offsets.

Constraints

AWS-first stack; existing tools are Airflow 2.x, Spark on EMR, dbt Core, Snowflake, Kafka (MSK), and Great Expectations
Incremental budget cap: $12K/month for non-production infrastructure
Compliance: PII cannot be copied directly; all lower environments must use masking or synthetic generation
Small platform team: 3 data engineers and 1 DevOps engineer, so operational complexity must stay low

Maintain Reliable ETL Test Environments

Easy

Pipelines

Context

Scale Requirements

Pipelines: 120 scheduled DAGs, 25 critical hourly jobs, 8 streaming jobs
Data volume: 6 TB/day in production; test environment should support representative subsets of 200-500 GB/day
Concurrency: Up to 30 parallel CI test runs and 10 active developer sandboxes
Latency: CI pipeline validation must complete in < 20 minutes; ephemeral environment provisioning in < 10 minutes
Retention: Keep test artifacts and logs for 30 days; synthetic/masked datasets refreshed daily

Requirements

Design separate CI, shared integration, and developer sandbox environments for ETL and streaming pipelines.
Ensure test datasets are production-like using masked snapshots, synthetic records, and deterministic seed data.
Support automated validation for schema compatibility, idempotency, backfills, and data quality checks before deployment.
Isolate orchestration state, compute, and storage so concurrent test runs do not corrupt each other.
Define how Airflow, Spark, dbt, and Snowflake objects are namespaced and cleaned up.
Include monitoring, alerting, and failure recovery for environment drift, stale test data, and broken test pipelines.
Show how to test streaming jobs with replayable Kafka topics and deterministic offsets.

Constraints

AWS-first stack; existing tools are Airflow 2.x, Spark on EMR, dbt Core, Snowflake, Kafka (MSK), and Great Expectations
Incremental budget cap: $12K/month for non-production infrastructure
Compliance: PII cannot be copied directly; all lower environments must use masking or synthetic generation
Small platform team: 3 data engineers and 1 DevOps engineer, so operational complexity must stay low

Your Answer

Maintain Reliable ETL Test Environments

Easy

Pipelines

Context

Scale Requirements

Pipelines: 120 scheduled DAGs, 25 critical hourly jobs, 8 streaming jobs
Data volume: 6 TB/day in production; test environment should support representative subsets of 200-500 GB/day
Concurrency: Up to 30 parallel CI test runs and 10 active developer sandboxes
Latency: CI pipeline validation must complete in < 20 minutes; ephemeral environment provisioning in < 10 minutes
Retention: Keep test artifacts and logs for 30 days; synthetic/masked datasets refreshed daily

Requirements

Design separate CI, shared integration, and developer sandbox environments for ETL and streaming pipelines.
Ensure test datasets are production-like using masked snapshots, synthetic records, and deterministic seed data.
Support automated validation for schema compatibility, idempotency, backfills, and data quality checks before deployment.
Isolate orchestration state, compute, and storage so concurrent test runs do not corrupt each other.
Define how Airflow, Spark, dbt, and Snowflake objects are namespaced and cleaned up.
Include monitoring, alerting, and failure recovery for environment drift, stale test data, and broken test pipelines.
Show how to test streaming jobs with replayable Kafka topics and deterministic offsets.

Constraints

AWS-first stack; existing tools are Airflow 2.x, Spark on EMR, dbt Core, Snowflake, Kafka (MSK), and Great Expectations
Incremental budget cap: $12K/month for non-production infrastructure
Compliance: PII cannot be copied directly; all lower environments must use masking or synthetic generation
Small platform team: 3 data engineers and 1 DevOps engineer, so operational complexity must stay low

Maintain Reliable ETL Test Environments

Easy

Pipelines

Context

Scale Requirements

Pipelines: 120 scheduled DAGs, 25 critical hourly jobs, 8 streaming jobs
Data volume: 6 TB/day in production; test environment should support representative subsets of 200-500 GB/day
Concurrency: Up to 30 parallel CI test runs and 10 active developer sandboxes
Latency: CI pipeline validation must complete in < 20 minutes; ephemeral environment provisioning in < 10 minutes
Retention: Keep test artifacts and logs for 30 days; synthetic/masked datasets refreshed daily

Requirements

Design separate CI, shared integration, and developer sandbox environments for ETL and streaming pipelines.
Ensure test datasets are production-like using masked snapshots, synthetic records, and deterministic seed data.
Support automated validation for schema compatibility, idempotency, backfills, and data quality checks before deployment.
Isolate orchestration state, compute, and storage so concurrent test runs do not corrupt each other.
Define how Airflow, Spark, dbt, and Snowflake objects are namespaced and cleaned up.
Include monitoring, alerting, and failure recovery for environment drift, stale test data, and broken test pipelines.
Show how to test streaming jobs with replayable Kafka topics and deterministic offsets.

Constraints

AWS-first stack; existing tools are Airflow 2.x, Spark on EMR, dbt Core, Snowflake, Kafka (MSK), and Great Expectations
Incremental budget cap: $12K/month for non-production infrastructure
Compliance: PII cannot be copied directly; all lower environments must use masking or synthetic generation
Small platform team: 3 data engineers and 1 DevOps engineer, so operational complexity must stay low