Design Safe Pipeline Deployment Rollback

Context

LedgerLoop, a fintech company, runs nightly and hourly ETL pipelines that ingest payment, ledger, and customer data from PostgreSQL and Kafka into Snowflake. A recent Airflow deployment introduced a dbt model change that duplicated transactions and broke downstream finance dashboards, so the team needs a rollback design that restores correct data quickly without losing or reprocessing records incorrectly.

Scale Requirements

Batch volume: 2.5 TB/day across 180 Airflow DAGs
Streaming volume: 40K events/sec peak from Kafka topics
Latency targets: hourly pipelines must recover within 15 minutes; streaming pipelines within 5 minutes
Storage: 90-day raw retention in S3, 3-year curated retention in Snowflake
Recovery objective: RTO < 20 minutes, RPO < 5 minutes for critical finance tables

Requirements

Design a deployment strategy for Airflow DAGs, Spark jobs, and dbt models that supports fast rollback after a bad release.
Ensure rollback covers both code rollback and data rollback for partially processed batches and streaming checkpoints.
Prevent duplicate loads during reruns using idempotent writes, deterministic batch IDs, and transactional merge patterns.
Define how to detect a bad deployment using data quality checks, pipeline health metrics, and downstream table validation.
Describe how orchestration should pause affected DAGs, revert to the previous artifact version, and safely resume dependencies.
Explain how to handle in-flight Kafka offsets, Spark checkpoints, and Snowflake tables during rollback.
Include a plan for replay/backfill of corrupted windows after rollback is complete.

Constraints

AWS-first stack; no migration to a new orchestrator
Small platform team: 3 data engineers, 1 SRE
SOX compliance requires auditable deployment history and data correction logs
Monthly incremental infrastructure budget capped at $18K
No more than 10 minutes of dashboard unavailability for finance stakeholders

Context

Scale Requirements

Batch volume: 2.5 TB/day across 180 Airflow DAGs
Streaming volume: 40K events/sec peak from Kafka topics
Latency targets: hourly pipelines must recover within 15 minutes; streaming pipelines within 5 minutes
Storage: 90-day raw retention in S3, 3-year curated retention in Snowflake
Recovery objective: RTO < 20 minutes, RPO < 5 minutes for critical finance tables

Requirements

Design a deployment strategy for Airflow DAGs, Spark jobs, and dbt models that supports fast rollback after a bad release.
Ensure rollback covers both code rollback and data rollback for partially processed batches and streaming checkpoints.
Prevent duplicate loads during reruns using idempotent writes, deterministic batch IDs, and transactional merge patterns.
Define how to detect a bad deployment using data quality checks, pipeline health metrics, and downstream table validation.
Describe how orchestration should pause affected DAGs, revert to the previous artifact version, and safely resume dependencies.
Explain how to handle in-flight Kafka offsets, Spark checkpoints, and Snowflake tables during rollback.
Include a plan for replay/backfill of corrupted windows after rollback is complete.

Constraints

AWS-first stack; no migration to a new orchestrator
Small platform team: 3 data engineers, 1 SRE
SOX compliance requires auditable deployment history and data correction logs
Monthly incremental infrastructure budget capped at $18K
No more than 10 minutes of dashboard unavailability for finance stakeholders

Context

Scale Requirements

Batch volume: 2.5 TB/day across 180 Airflow DAGs
Streaming volume: 40K events/sec peak from Kafka topics
Latency targets: hourly pipelines must recover within 15 minutes; streaming pipelines within 5 minutes
Storage: 90-day raw retention in S3, 3-year curated retention in Snowflake
Recovery objective: RTO < 20 minutes, RPO < 5 minutes for critical finance tables

Requirements

Design a deployment strategy for Airflow DAGs, Spark jobs, and dbt models that supports fast rollback after a bad release.
Ensure rollback covers both code rollback and data rollback for partially processed batches and streaming checkpoints.
Prevent duplicate loads during reruns using idempotent writes, deterministic batch IDs, and transactional merge patterns.
Define how to detect a bad deployment using data quality checks, pipeline health metrics, and downstream table validation.
Describe how orchestration should pause affected DAGs, revert to the previous artifact version, and safely resume dependencies.
Explain how to handle in-flight Kafka offsets, Spark checkpoints, and Snowflake tables during rollback.
Include a plan for replay/backfill of corrupted windows after rollback is complete.

Constraints

AWS-first stack; no migration to a new orchestrator
Small platform team: 3 data engineers, 1 SRE
SOX compliance requires auditable deployment history and data correction logs
Monthly incremental infrastructure budget capped at $18K
No more than 10 minutes of dashboard unavailability for finance stakeholders

Context

Scale Requirements

Batch volume: 2.5 TB/day across 180 Airflow DAGs
Streaming volume: 40K events/sec peak from Kafka topics
Latency targets: hourly pipelines must recover within 15 minutes; streaming pipelines within 5 minutes
Storage: 90-day raw retention in S3, 3-year curated retention in Snowflake
Recovery objective: RTO < 20 minutes, RPO < 5 minutes for critical finance tables

Requirements

Design a deployment strategy for Airflow DAGs, Spark jobs, and dbt models that supports fast rollback after a bad release.
Ensure rollback covers both code rollback and data rollback for partially processed batches and streaming checkpoints.
Prevent duplicate loads during reruns using idempotent writes, deterministic batch IDs, and transactional merge patterns.
Define how to detect a bad deployment using data quality checks, pipeline health metrics, and downstream table validation.
Describe how orchestration should pause affected DAGs, revert to the previous artifact version, and safely resume dependencies.
Explain how to handle in-flight Kafka offsets, Spark checkpoints, and Snowflake tables during rollback.
Include a plan for replay/backfill of corrupted windows after rollback is complete.

Constraints

AWS-first stack; no migration to a new orchestrator
Small platform team: 3 data engineers, 1 SRE
SOX compliance requires auditable deployment history and data correction logs
Monthly incremental infrastructure budget capped at $18K
No more than 10 minutes of dashboard unavailability for finance stakeholders

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Safe Pipeline Deployment Rollback

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Safe Pipeline Deployment Rollback

Context

Scale Requirements

Requirements

Constraints

Design Safe Pipeline Deployment Rollback

Context

Scale Requirements

Requirements

Constraints

Your Answer