Context
Databricks wants a standardized deployment workflow for lakehouse pipelines across multiple workspaces (dev, staging, prod). Today, teams manually promote Databricks Asset Bundles, Delta Live Tables pipelines, Databricks Jobs, and Unity Catalog objects, which causes drift, inconsistent approvals, and failed releases.
You are asked to design a CI/CD architecture using GitHub Actions as the primary control plane, with deployment targets in Databricks on AWS. The workflow must support both batch and streaming data pipelines and provide safe promotion of code, configuration, and infrastructure changes.
Scale Requirements
- Repositories: 120+ data platform repos
- Deployments: ~300 pipeline deploys/day across environments
- Pipelines: 800+ Databricks Jobs and 150+ Delta Live Tables / Lakeflow Declarative Pipelines
- Latency: PR validation < 15 minutes; production deployment < 10 minutes
- Artifacts: Python wheels, SQL files, bundle configs, Terraform plans
- Availability target: 99.9% successful deployment workflow execution
Requirements
- Design a GitHub Actions workflow that validates, packages, and deploys Databricks Asset Bundles to dev, staging, and prod workspaces.
- Include promotion controls for notebooks, Python code, Delta Live Tables/Lakeflow pipelines, Databricks Jobs, and Unity Catalog permissions.
- Define how secrets and authentication should work using GitHub OIDC or service principals without long-lived credentials.
- Add automated quality gates: unit tests, integration tests against ephemeral or shared test clusters, bundle validation, and post-deploy smoke tests.
- Support rollback for failed releases, including reverting job definitions and pipeline configs.
- Explain how to manage environment-specific configuration, dependency ordering, and concurrent deployments safely.
- Describe monitoring, alerting, and auditability for both CI failures and runtime deployment failures.
Constraints
- Must use Databricks-native deployment surfaces where possible
- Production changes require approval and full audit trail
- No manual edits in production workspaces
- Budget limits excessive always-on test infrastructure
- Must satisfy SOX-style change management and least-privilege access