Context
A Databricks customer runs 40+ production data pipelines for finance, product analytics, and operational reporting. Today, jobs, clusters, Delta Live Tables logic, and Unity Catalog permissions are deployed manually across dev, staging, and prod workspaces, causing configuration drift, failed releases, and slow recovery during incidents.
You are asked to redesign the deployment and operations model so pipeline infrastructure is provisioned and promoted through automation, with strong operational controls and minimal manual intervention.
Scale Requirements
- Pipelines: 40 batch pipelines today, expected to grow to 120 within 12 months
- Runs: ~2,500 Databricks Job runs/day across environments
- Data volume: 18 TB/day ingested from SaaS APIs, object storage, and operational databases
- Latency: Critical batch SLAs of 15 minutes for hourly jobs; 2 hours for daily finance jobs
- Environments: 3 isolated Databricks workspaces (dev, staging, prod)
- Recovery targets: RPO < 15 minutes, RTO < 30 minutes for tier-1 pipelines
Requirements
- Design an automated CI/CD process for Databricks Jobs, Delta Live Tables pipelines, cluster policies, and Unity Catalog objects.
- Ensure deployments are idempotent, environment-specific, and rollback-safe.
- Define how pipeline code, infrastructure definitions, secrets, and configuration should be versioned and promoted.
- Include orchestration for dependency management, backfills, and scheduled releases.
- Add data quality gates before promoting pipeline changes to production.
- Specify monitoring for deployment health, pipeline SLA adherence, cost regressions, and failed runs.
- Explain how you would reduce manual ops work through standardization, reusable templates, and policy enforcement.
Constraints
- Must prefer native Databricks capabilities where possible
- Team has 5 data engineers and 2 platform engineers
- Monthly platform budget increase is capped at 15%
- Production changes require auditability and least-privilege access
- Some pipelines process SOX-relevant financial data and require approval gates