Design CI/CD for Data Pipelines

Context

FinLedge, a B2B payments company, runs 120 production data pipelines that ingest PostgreSQL CDC, S3 batch files, and Kafka events into Snowflake. Today, deployments are manual: Airflow DAGs, dbt models, and Spark jobs are pushed directly to production, causing broken dependencies, schema drift, and inconsistent rollback behavior.

You need to design a CI/CD pipeline for the data platform so engineering teams can safely test, deploy, and monitor changes to ETL/ELT workflows with minimal downtime.

Scale Requirements

Pipelines: 120 production workflows, growing 10% per quarter
Deploy frequency: 30-50 merges/day across DAGs, dbt, and Spark code
Data volume: 8 TB/day batch + 150K events/sec streaming peak
Latency targets: CI validation < 15 minutes; production deployment < 10 minutes; rollback < 5 minutes
Environments: dev, staging, prod across AWS accounts
Retention: CI artifacts and logs retained for 90 days

Requirements

Design a CI/CD process for Airflow DAGs, dbt transformations, and Spark jobs using Git-based workflows.
Validate code quality with unit tests, SQL tests, schema checks, and pipeline dependency validation before merge.
Support environment-specific configuration, secrets management, and promotion from dev to staging to prod.
Ensure idempotent deployments, safe rollback, and versioned artifacts for DAGs, container images, and dbt manifests.
Prevent bad releases from corrupting downstream tables or breaking scheduled jobs.
Include monitoring for deployment failures, data quality regressions, and runtime health after release.
Support both batch and streaming jobs without pausing critical financial reporting pipelines.

Constraints

AWS-first stack; existing services include MWAA, ECR, S3, EMR, and Snowflake
Team has 5 data engineers and 1 platform engineer
Monthly incremental tooling budget is capped at $15K
SOX-style auditability is required for production changes
Production secrets cannot be exposed in CI runners

Context

You need to design a CI/CD pipeline for the data platform so engineering teams can safely test, deploy, and monitor changes to ETL/ELT workflows with minimal downtime.

Scale Requirements

Pipelines: 120 production workflows, growing 10% per quarter
Deploy frequency: 30-50 merges/day across DAGs, dbt, and Spark code
Data volume: 8 TB/day batch + 150K events/sec streaming peak
Latency targets: CI validation < 15 minutes; production deployment < 10 minutes; rollback < 5 minutes
Environments: dev, staging, prod across AWS accounts
Retention: CI artifacts and logs retained for 90 days

Requirements

Design a CI/CD process for Airflow DAGs, dbt transformations, and Spark jobs using Git-based workflows.
Validate code quality with unit tests, SQL tests, schema checks, and pipeline dependency validation before merge.
Support environment-specific configuration, secrets management, and promotion from dev to staging to prod.
Ensure idempotent deployments, safe rollback, and versioned artifacts for DAGs, container images, and dbt manifests.
Prevent bad releases from corrupting downstream tables or breaking scheduled jobs.
Include monitoring for deployment failures, data quality regressions, and runtime health after release.
Support both batch and streaming jobs without pausing critical financial reporting pipelines.

Constraints

AWS-first stack; existing services include MWAA, ECR, S3, EMR, and Snowflake
Team has 5 data engineers and 1 platform engineer
Monthly incremental tooling budget is capped at $15K
SOX-style auditability is required for production changes
Production secrets cannot be exposed in CI runners

Context

You need to design a CI/CD pipeline for the data platform so engineering teams can safely test, deploy, and monitor changes to ETL/ELT workflows with minimal downtime.

Scale Requirements

Pipelines: 120 production workflows, growing 10% per quarter
Deploy frequency: 30-50 merges/day across DAGs, dbt, and Spark code
Data volume: 8 TB/day batch + 150K events/sec streaming peak
Latency targets: CI validation < 15 minutes; production deployment < 10 minutes; rollback < 5 minutes
Environments: dev, staging, prod across AWS accounts
Retention: CI artifacts and logs retained for 90 days

Requirements

Design a CI/CD process for Airflow DAGs, dbt transformations, and Spark jobs using Git-based workflows.
Validate code quality with unit tests, SQL tests, schema checks, and pipeline dependency validation before merge.
Support environment-specific configuration, secrets management, and promotion from dev to staging to prod.
Ensure idempotent deployments, safe rollback, and versioned artifacts for DAGs, container images, and dbt manifests.
Prevent bad releases from corrupting downstream tables or breaking scheduled jobs.
Include monitoring for deployment failures, data quality regressions, and runtime health after release.
Support both batch and streaming jobs without pausing critical financial reporting pipelines.

Constraints

AWS-first stack; existing services include MWAA, ECR, S3, EMR, and Snowflake
Team has 5 data engineers and 1 platform engineer
Monthly incremental tooling budget is capped at $15K
SOX-style auditability is required for production changes
Production secrets cannot be exposed in CI runners

Context

You need to design a CI/CD pipeline for the data platform so engineering teams can safely test, deploy, and monitor changes to ETL/ELT workflows with minimal downtime.

Scale Requirements

Pipelines: 120 production workflows, growing 10% per quarter
Deploy frequency: 30-50 merges/day across DAGs, dbt, and Spark code
Data volume: 8 TB/day batch + 150K events/sec streaming peak
Latency targets: CI validation < 15 minutes; production deployment < 10 minutes; rollback < 5 minutes
Environments: dev, staging, prod across AWS accounts
Retention: CI artifacts and logs retained for 90 days

Requirements

Design a CI/CD process for Airflow DAGs, dbt transformations, and Spark jobs using Git-based workflows.
Validate code quality with unit tests, SQL tests, schema checks, and pipeline dependency validation before merge.
Support environment-specific configuration, secrets management, and promotion from dev to staging to prod.
Ensure idempotent deployments, safe rollback, and versioned artifacts for DAGs, container images, and dbt manifests.
Prevent bad releases from corrupting downstream tables or breaking scheduled jobs.
Include monitoring for deployment failures, data quality regressions, and runtime health after release.
Support both batch and streaming jobs without pausing critical financial reporting pipelines.

Constraints

AWS-first stack; existing services include MWAA, ECR, S3, EMR, and Snowflake
Team has 5 data engineers and 1 platform engineer
Monthly incremental tooling budget is capped at $15K
SOX-style auditability is required for production changes
Production secrets cannot be exposed in CI runners

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design CI/CD for Data Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design CI/CD for Data Pipelines

Context

Scale Requirements

Requirements

Constraints

Design CI/CD for Data Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer