Context
A Databricks customer runs a production Delta Live Tables pipeline that ingests order, inventory, and fulfillment events into Delta tables used by finance and operations dashboards. Releases are currently done in-place, causing occasional schema regressions, duplicate writes, and 10-20 minutes of downstream instability during upgrades.
Design a blue/green deployment strategy for this data pipeline on Databricks that enables zero-downtime cutover between two parallel pipeline environments while preserving data correctness and rollback safety.
Scale Requirements
- Ingestion rate: 120K events/sec peak, 35K avg
- Sources: Kafka topics for CDC/events, daily batch reference files in cloud object storage
- Data volume: ~9 TB/day raw, ~2.5 TB/day curated Delta output
- Latency target: streaming tables queryable within 90 seconds end-to-end
- Availability target: 99.95% for production data products
- Retention: 180 days raw Bronze, 2 years Silver/Gold
Requirements
- Design separate blue and green Databricks pipeline environments, including compute, checkpoints, Unity Catalog objects, and deployment automation.
- Explain how both environments can process the same upstream data safely without duplicate publication to production consumers.
- Define the cutover mechanism for downstream readers with zero downtime, using Databricks-native surfaces where possible.
- Include validation gates before promotion: schema compatibility, row-count reconciliation, freshness SLA, and data quality checks.
- Describe rollback behavior if the green pipeline passes initial checks but later shows correctness or latency issues.
- Address stateful streaming concerns such as checkpoint isolation, exactly-once semantics, idempotent writes, and late-arriving events.
- Specify monitoring, alerting, and deployment orchestration for routine releases and emergency rollback.
Constraints
- Must run primarily on Databricks: Delta Live Tables or Lakeflow Declarative Pipelines, Workflows, Unity Catalog, Delta Lake, and Structured Streaming.
- No consumer-visible downtime and no manual table rewiring across dozens of BI jobs.
- Budget allows temporary double-compute during deployment windows only.
- SOX controls require auditable promotion steps and reproducible deployments via Git-backed CI/CD.