Context
Databricks runs multiple production data pipelines using Databricks Workflows, Delta Live Tables, and dbt on Databricks. Today, deployments of pipeline code and configuration are handled through CI/CD, but observability is fragmented: engineers can see whether a deployment succeeded, yet they cannot easily trace which release introduced data quality regressions, latency spikes, or failed downstream tasks.
You are asked to design a deployment process where observability is built in by default. The goal is to make every deployment of a batch or streaming pipeline measurable, auditable, and easy to roll back.
Scale Requirements
- Pipelines: 250 production pipelines across dev/staging/prod
- Deployments: 400-600 deployments per day across environments
- Workloads: 70% batch, 30% Structured Streaming / Delta Live Tables
- Latency target: deployment health visible within 2 minutes of rollout
- Log/metric retention: 90 days hot, 1 year archived
- Data volume: ~15 TB/day processed by deployed pipelines
Requirements
- Design a CI/CD deployment pipeline for Databricks Asset Bundles, Databricks Workflows, and Delta Live Tables that emits deployment events, metrics, and traces at each stage.
- Correlate a deployment version with pipeline runs, cluster/job configuration, Unity Catalog tables touched, and data quality outcomes.
- Define pre-deployment checks such as bundle validation, schema compatibility, expectations/tests, and policy enforcement.
- Define post-deployment verification for canary runs, SLA checks, freshness, row-count drift, and error-budget based rollback.
- Ensure deployments are idempotent and support safe retries, partial failures, and environment promotion.
- Propose dashboards, alerts, and audit trails using Databricks-native capabilities where possible.
Constraints
- Primary platform is Databricks on AWS with Unity Catalog enabled
- Prefer Databricks Workflows, Lakehouse Monitoring, system tables, and Databricks SQL over external generic tools unless clearly justified
- Team has 5 platform engineers and limited tolerance for custom control-plane services
- Must support SOX-style auditability and least-privilege deployment identities
- Incremental monthly observability budget is capped at $20K