Design Databricks Pipeline Platform

Context

A B2B SaaS company is standardizing its data platform on Databricks after years of running fragmented ETL on self-managed Spark, cron jobs, and ad hoc cloud scripts. The DevOps Engineer is responsible for designing the Databricks environment so batch and streaming pipelines can be deployed, governed, monitored, and operated consistently across development, staging, and production workspaces.

The platform must support both Lakeflow Declarative Pipelines and Databricks Jobs, with centralized workspace administration, Unity Catalog governance, cluster policy enforcement, and end-to-end resource monitoring. The goal is to reduce pipeline failures caused by inconsistent configuration, uncontrolled compute sprawl, and weak observability.

Scale Requirements

Workspaces: 3 environments, 250 total users, 40 service principals
Pipelines: 180 batch pipelines, 25 streaming pipelines
Volume: 12 TB/day ingested, 2.5B records/day
Latency: batch SLAs of 30-60 minutes; streaming freshness < 2 minutes
Compute: 150+ concurrent job runs during peak business hours
Retention: operational logs for 90 days, audit logs for 1 year

Requirements

Design the Databricks workspace topology, account-level setup, networking, and environment isolation strategy.
Define how pipelines are deployed using Databricks Asset Bundles, Databricks Repos, and CI/CD across dev/stage/prod.
Specify workspace administration controls: SCIM provisioning, group-based RBAC, cluster policies, secret scopes, and Unity Catalog permissions.
Design orchestration for batch and streaming pipelines using Databricks Jobs and Lakeflow Declarative Pipelines, including dependencies, retries, and backfills.
Implement data quality controls for ingestion and transformation layers using Delta Lake constraints and pipeline expectations.
Define monitoring for pipeline health, cluster utilization, job failures, cost, and SLA adherence using Databricks system tables and alerting integrations.
Explain how you would handle incident response, failed runs, schema changes, and environment drift.

Constraints

Must run primarily on Databricks-native services; avoid introducing extra orchestration tools unless justified.
Production data is governed under SOC 2 and GDPR requirements.
Monthly platform budget is capped, so idle compute and duplicate environments must be minimized.
The team has 3 platform engineers and 6 data engineers, so operational simplicity matters.

Context

Scale Requirements

Workspaces: 3 environments, 250 total users, 40 service principals
Pipelines: 180 batch pipelines, 25 streaming pipelines
Volume: 12 TB/day ingested, 2.5B records/day
Latency: batch SLAs of 30-60 minutes; streaming freshness < 2 minutes
Compute: 150+ concurrent job runs during peak business hours
Retention: operational logs for 90 days, audit logs for 1 year

Requirements

Design the Databricks workspace topology, account-level setup, networking, and environment isolation strategy.
Define how pipelines are deployed using Databricks Asset Bundles, Databricks Repos, and CI/CD across dev/stage/prod.
Specify workspace administration controls: SCIM provisioning, group-based RBAC, cluster policies, secret scopes, and Unity Catalog permissions.
Design orchestration for batch and streaming pipelines using Databricks Jobs and Lakeflow Declarative Pipelines, including dependencies, retries, and backfills.
Implement data quality controls for ingestion and transformation layers using Delta Lake constraints and pipeline expectations.
Define monitoring for pipeline health, cluster utilization, job failures, cost, and SLA adherence using Databricks system tables and alerting integrations.
Explain how you would handle incident response, failed runs, schema changes, and environment drift.

Constraints

Must run primarily on Databricks-native services; avoid introducing extra orchestration tools unless justified.
Production data is governed under SOC 2 and GDPR requirements.
Monthly platform budget is capped, so idle compute and duplicate environments must be minimized.
The team has 3 platform engineers and 6 data engineers, so operational simplicity matters.

Context

Scale Requirements

Workspaces: 3 environments, 250 total users, 40 service principals
Pipelines: 180 batch pipelines, 25 streaming pipelines
Volume: 12 TB/day ingested, 2.5B records/day
Latency: batch SLAs of 30-60 minutes; streaming freshness < 2 minutes
Compute: 150+ concurrent job runs during peak business hours
Retention: operational logs for 90 days, audit logs for 1 year

Requirements

Design the Databricks workspace topology, account-level setup, networking, and environment isolation strategy.
Define how pipelines are deployed using Databricks Asset Bundles, Databricks Repos, and CI/CD across dev/stage/prod.
Specify workspace administration controls: SCIM provisioning, group-based RBAC, cluster policies, secret scopes, and Unity Catalog permissions.
Design orchestration for batch and streaming pipelines using Databricks Jobs and Lakeflow Declarative Pipelines, including dependencies, retries, and backfills.
Implement data quality controls for ingestion and transformation layers using Delta Lake constraints and pipeline expectations.
Define monitoring for pipeline health, cluster utilization, job failures, cost, and SLA adherence using Databricks system tables and alerting integrations.
Explain how you would handle incident response, failed runs, schema changes, and environment drift.

Constraints

Must run primarily on Databricks-native services; avoid introducing extra orchestration tools unless justified.
Production data is governed under SOC 2 and GDPR requirements.
Monthly platform budget is capped, so idle compute and duplicate environments must be minimized.
The team has 3 platform engineers and 6 data engineers, so operational simplicity matters.

Context

Scale Requirements

Workspaces: 3 environments, 250 total users, 40 service principals
Pipelines: 180 batch pipelines, 25 streaming pipelines
Volume: 12 TB/day ingested, 2.5B records/day
Latency: batch SLAs of 30-60 minutes; streaming freshness < 2 minutes
Compute: 150+ concurrent job runs during peak business hours
Retention: operational logs for 90 days, audit logs for 1 year

Requirements

Design the Databricks workspace topology, account-level setup, networking, and environment isolation strategy.
Define how pipelines are deployed using Databricks Asset Bundles, Databricks Repos, and CI/CD across dev/stage/prod.
Specify workspace administration controls: SCIM provisioning, group-based RBAC, cluster policies, secret scopes, and Unity Catalog permissions.
Design orchestration for batch and streaming pipelines using Databricks Jobs and Lakeflow Declarative Pipelines, including dependencies, retries, and backfills.
Implement data quality controls for ingestion and transformation layers using Delta Lake constraints and pipeline expectations.
Define monitoring for pipeline health, cluster utilization, job failures, cost, and SLA adherence using Databricks system tables and alerting integrations.
Explain how you would handle incident response, failed runs, schema changes, and environment drift.

Constraints

Must run primarily on Databricks-native services; avoid introducing extra orchestration tools unless justified.
Production data is governed under SOC 2 and GDPR requirements.
Monthly platform budget is capped, so idle compute and duplicate environments must be minimized.
The team has 3 platform engineers and 6 data engineers, so operational simplicity matters.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Databricks Pipeline Platform

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Databricks Pipeline Platform

Context

Scale Requirements

Requirements

Constraints

Design Databricks Pipeline Platform

Context

Scale Requirements

Requirements

Constraints

Your Answer