Context
LedgerLoop, a fintech company, runs nightly and hourly ETL pipelines that ingest payment, ledger, and customer data from PostgreSQL and Kafka into Snowflake. A recent Airflow deployment introduced a dbt model change that duplicated transactions and broke downstream finance dashboards, so the team needs a rollback design that restores correct data quickly without losing or reprocessing records incorrectly.
Scale Requirements
- Batch volume: 2.5 TB/day across 180 Airflow DAGs
- Streaming volume: 40K events/sec peak from Kafka topics
- Latency targets: hourly pipelines must recover within 15 minutes; streaming pipelines within 5 minutes
- Storage: 90-day raw retention in S3, 3-year curated retention in Snowflake
- Recovery objective: RTO < 20 minutes, RPO < 5 minutes for critical finance tables
Requirements
- Design a deployment strategy for Airflow DAGs, Spark jobs, and dbt models that supports fast rollback after a bad release.
- Ensure rollback covers both code rollback and data rollback for partially processed batches and streaming checkpoints.
- Prevent duplicate loads during reruns using idempotent writes, deterministic batch IDs, and transactional merge patterns.
- Define how to detect a bad deployment using data quality checks, pipeline health metrics, and downstream table validation.
- Describe how orchestration should pause affected DAGs, revert to the previous artifact version, and safely resume dependencies.
- Explain how to handle in-flight Kafka offsets, Spark checkpoints, and Snowflake tables during rollback.
- Include a plan for replay/backfill of corrupted windows after rollback is complete.
Constraints
- AWS-first stack; no migration to a new orchestrator
- Small platform team: 3 data engineers, 1 SRE
- SOX compliance requires auditable deployment history and data correction logs
- Monthly incremental infrastructure budget capped at $18K
- No more than 10 minutes of dashboard unavailability for finance stakeholders