Context
A B2B SaaS company is standardizing its data platform on Databricks after years of running fragmented ETL on self-managed Spark, cron jobs, and ad hoc cloud scripts. The DevOps Engineer is responsible for designing the Databricks environment so batch and streaming pipelines can be deployed, governed, monitored, and operated consistently across development, staging, and production workspaces.
The platform must support both Lakeflow Declarative Pipelines and Databricks Jobs, with centralized workspace administration, Unity Catalog governance, cluster policy enforcement, and end-to-end resource monitoring. The goal is to reduce pipeline failures caused by inconsistent configuration, uncontrolled compute sprawl, and weak observability.
Scale Requirements
- Workspaces: 3 environments, 250 total users, 40 service principals
- Pipelines: 180 batch pipelines, 25 streaming pipelines
- Volume: 12 TB/day ingested, 2.5B records/day
- Latency: batch SLAs of 30-60 minutes; streaming freshness < 2 minutes
- Compute: 150+ concurrent job runs during peak business hours
- Retention: operational logs for 90 days, audit logs for 1 year
Requirements
- Design the Databricks workspace topology, account-level setup, networking, and environment isolation strategy.
- Define how pipelines are deployed using Databricks Asset Bundles, Databricks Repos, and CI/CD across dev/stage/prod.
- Specify workspace administration controls: SCIM provisioning, group-based RBAC, cluster policies, secret scopes, and Unity Catalog permissions.
- Design orchestration for batch and streaming pipelines using Databricks Jobs and Lakeflow Declarative Pipelines, including dependencies, retries, and backfills.
- Implement data quality controls for ingestion and transformation layers using Delta Lake constraints and pipeline expectations.
- Define monitoring for pipeline health, cluster utilization, job failures, cost, and SLA adherence using Databricks system tables and alerting integrations.
- Explain how you would handle incident response, failed runs, schema changes, and environment drift.
Constraints
- Must run primarily on Databricks-native services; avoid introducing extra orchestration tools unless justified.
- Production data is governed under SOC 2 and GDPR requirements.
- Monthly platform budget is capped, so idle compute and duplicate environments must be minimized.
- The team has 3 platform engineers and 6 data engineers, so operational simplicity matters.