Interview Guides

Implement CI/CD for Data Pipelines

Easy

Pipelines

Context

Northstar Retail runs 120 batch and near-real-time data pipelines on AWS using Apache Airflow, dbt, Spark, and Snowflake. Today, DAGs and transformation code are deployed manually from developer laptops, causing inconsistent environments, failed releases, and no reliable rollback path.

You need to design a CI/CD pipeline for the data platform so code changes can be validated, tested, deployed, and monitored safely across dev, staging, and production.

Scale Requirements

Pipelines managed: 120 Airflow DAGs, 450 dbt models, 35 Spark jobs
Deploy frequency: 20-30 merges/day across 8 engineers
Latency target: CI feedback in < 10 minutes for standard changes; production deploy < 15 minutes
Data volume affected: ~14 TB/day across S3 and Snowflake
Availability target: 99.9% successful scheduled runs after deployment
Rollback target: Restore previous stable release in < 5 minutes

Requirements

Design a CI pipeline that validates Python, SQL, DAG definitions, dbt models, and infrastructure changes on every pull request.
Include automated unit, integration, and data quality tests before promotion to production.
Support environment-specific configuration and secrets management without hardcoding credentials.
Ensure deployments are idempotent and prevent partial releases across Airflow, dbt, and Spark assets.
Define a release strategy for dev, staging, and production, including approvals and rollback.
Add monitoring for deployment health, pipeline failures, and post-release data quality regressions.
Explain how you would handle schema changes, backfills, and breaking DAG updates safely.

Constraints

Existing stack must remain AWS + Snowflake; no full platform rewrite
Team is small: 3 data engineers, 1 platform engineer
Monthly incremental tooling budget is capped at $12K
Production changes must be auditable for SOX compliance
Some pipelines process PII, so secrets and artifacts must be encrypted at rest and in transit

Implement CI/CD for Data Pipelines

Easy

Pipelines

Context

You need to design a CI/CD pipeline for the data platform so code changes can be validated, tested, deployed, and monitored safely across dev, staging, and production.

Scale Requirements

Pipelines managed: 120 Airflow DAGs, 450 dbt models, 35 Spark jobs
Deploy frequency: 20-30 merges/day across 8 engineers
Latency target: CI feedback in < 10 minutes for standard changes; production deploy < 15 minutes
Data volume affected: ~14 TB/day across S3 and Snowflake
Availability target: 99.9% successful scheduled runs after deployment
Rollback target: Restore previous stable release in < 5 minutes

Requirements

Design a CI pipeline that validates Python, SQL, DAG definitions, dbt models, and infrastructure changes on every pull request.
Include automated unit, integration, and data quality tests before promotion to production.
Support environment-specific configuration and secrets management without hardcoding credentials.
Ensure deployments are idempotent and prevent partial releases across Airflow, dbt, and Spark assets.
Define a release strategy for dev, staging, and production, including approvals and rollback.
Add monitoring for deployment health, pipeline failures, and post-release data quality regressions.
Explain how you would handle schema changes, backfills, and breaking DAG updates safely.

Constraints

Existing stack must remain AWS + Snowflake; no full platform rewrite
Team is small: 3 data engineers, 1 platform engineer
Monthly incremental tooling budget is capped at $12K
Production changes must be auditable for SOX compliance
Some pipelines process PII, so secrets and artifacts must be encrypted at rest and in transit

Your Answer

Implement CI/CD for Data Pipelines

Easy

Pipelines

Context

You need to design a CI/CD pipeline for the data platform so code changes can be validated, tested, deployed, and monitored safely across dev, staging, and production.

Scale Requirements

Pipelines managed: 120 Airflow DAGs, 450 dbt models, 35 Spark jobs
Deploy frequency: 20-30 merges/day across 8 engineers
Latency target: CI feedback in < 10 minutes for standard changes; production deploy < 15 minutes
Data volume affected: ~14 TB/day across S3 and Snowflake
Availability target: 99.9% successful scheduled runs after deployment
Rollback target: Restore previous stable release in < 5 minutes

Requirements

Design a CI pipeline that validates Python, SQL, DAG definitions, dbt models, and infrastructure changes on every pull request.
Include automated unit, integration, and data quality tests before promotion to production.
Support environment-specific configuration and secrets management without hardcoding credentials.
Ensure deployments are idempotent and prevent partial releases across Airflow, dbt, and Spark assets.
Define a release strategy for dev, staging, and production, including approvals and rollback.
Add monitoring for deployment health, pipeline failures, and post-release data quality regressions.
Explain how you would handle schema changes, backfills, and breaking DAG updates safely.

Constraints

Existing stack must remain AWS + Snowflake; no full platform rewrite
Team is small: 3 data engineers, 1 platform engineer
Monthly incremental tooling budget is capped at $12K
Production changes must be auditable for SOX compliance
Some pipelines process PII, so secrets and artifacts must be encrypted at rest and in transit

Implement CI/CD for Data Pipelines

Easy

Pipelines

Context

You need to design a CI/CD pipeline for the data platform so code changes can be validated, tested, deployed, and monitored safely across dev, staging, and production.

Scale Requirements

Pipelines managed: 120 Airflow DAGs, 450 dbt models, 35 Spark jobs
Deploy frequency: 20-30 merges/day across 8 engineers
Latency target: CI feedback in < 10 minutes for standard changes; production deploy < 15 minutes
Data volume affected: ~14 TB/day across S3 and Snowflake
Availability target: 99.9% successful scheduled runs after deployment
Rollback target: Restore previous stable release in < 5 minutes

Requirements

Design a CI pipeline that validates Python, SQL, DAG definitions, dbt models, and infrastructure changes on every pull request.
Include automated unit, integration, and data quality tests before promotion to production.
Support environment-specific configuration and secrets management without hardcoding credentials.
Ensure deployments are idempotent and prevent partial releases across Airflow, dbt, and Spark assets.
Define a release strategy for dev, staging, and production, including approvals and rollback.
Add monitoring for deployment health, pipeline failures, and post-release data quality regressions.
Explain how you would handle schema changes, backfills, and breaking DAG updates safely.

Constraints

Existing stack must remain AWS + Snowflake; no full platform rewrite
Team is small: 3 data engineers, 1 platform engineer
Monthly incremental tooling budget is capped at $12K
Production changes must be auditable for SOX compliance
Some pipelines process PII, so secrets and artifacts must be encrypted at rest and in transit