Diagnose Databricks Pipeline Bottlenecks

Context

Databricks runs a production lakehouse pipeline that ingests application logs and CDC data into Delta tables using Databricks Workflows, Auto Loader, and Delta Live Tables. Over the last two weeks, several jobs have missed their SLA, but Spark UI symptoms are inconsistent: some runs show executor under-utilization, others stall during file discovery, checkpoint commits, or downstream table writes.

You are asked to design an operational debugging approach for a DevOps Engineer supporting Databricks pipelines. The goal is not only to identify whether the bottleneck is CPU, memory, disk, file descriptors, or network, but also to connect Linux-level signals from tools such as top, strace, lsof, tcpdump, and iostat to Databricks-specific pipeline stages and remediation actions.

Scale Requirements

Input volume: 12 TB/day of JSON and Parquet landing in cloud object storage
Streaming rate: 180K records/sec peak through Auto Loader
Batch jobs: 40 scheduled Databricks Workflows per hour
Latency SLA: Bronze to Silver < 8 minutes; Silver to Gold < 20 minutes
Cluster size: 20-80 worker nodes, i3en/Storage Optimized class equivalent
Retention: 30 days raw, 1 year curated Delta tables

Requirements

Propose a step-by-step method to diagnose slow Databricks pipeline runs using top, strace, lsof, tcpdump, and iostat on driver and worker nodes.
Explain how you would correlate OS-level findings with Databricks metrics: Ganglia/metrics, Spark UI, event logs, Delta transaction logs, and Databricks Workflows run history.
Cover at least three failure modes: CPU saturation, storage I/O contention, and network-related delays to cloud storage or metastore services.
Show how you would isolate whether the issue is in ingestion, shuffle, checkpointing, Delta commit, or downstream orchestration.
Include monitoring, alerting, and rollback/runbook recommendations.

Constraints

Must use Databricks-native services where possible
No packet capture on regulated payload data beyond metadata-safe filters
On-call team is small: 1 DevOps engineer and 2 data platform engineers
Incremental cloud spend for observability must stay under $8K/month

Context

Scale Requirements

Input volume: 12 TB/day of JSON and Parquet landing in cloud object storage
Streaming rate: 180K records/sec peak through Auto Loader
Batch jobs: 40 scheduled Databricks Workflows per hour
Latency SLA: Bronze to Silver < 8 minutes; Silver to Gold < 20 minutes
Cluster size: 20-80 worker nodes, i3en/Storage Optimized class equivalent
Retention: 30 days raw, 1 year curated Delta tables

Requirements

Propose a step-by-step method to diagnose slow Databricks pipeline runs using top, strace, lsof, tcpdump, and iostat on driver and worker nodes.
Explain how you would correlate OS-level findings with Databricks metrics: Ganglia/metrics, Spark UI, event logs, Delta transaction logs, and Databricks Workflows run history.
Cover at least three failure modes: CPU saturation, storage I/O contention, and network-related delays to cloud storage or metastore services.
Show how you would isolate whether the issue is in ingestion, shuffle, checkpointing, Delta commit, or downstream orchestration.
Include monitoring, alerting, and rollback/runbook recommendations.

Constraints

Must use Databricks-native services where possible
No packet capture on regulated payload data beyond metadata-safe filters
On-call team is small: 1 DevOps engineer and 2 data platform engineers
Incremental cloud spend for observability must stay under $8K/month

Context

Scale Requirements

Input volume: 12 TB/day of JSON and Parquet landing in cloud object storage
Streaming rate: 180K records/sec peak through Auto Loader
Batch jobs: 40 scheduled Databricks Workflows per hour
Latency SLA: Bronze to Silver < 8 minutes; Silver to Gold < 20 minutes
Cluster size: 20-80 worker nodes, i3en/Storage Optimized class equivalent
Retention: 30 days raw, 1 year curated Delta tables

Requirements

Propose a step-by-step method to diagnose slow Databricks pipeline runs using top, strace, lsof, tcpdump, and iostat on driver and worker nodes.
Explain how you would correlate OS-level findings with Databricks metrics: Ganglia/metrics, Spark UI, event logs, Delta transaction logs, and Databricks Workflows run history.
Cover at least three failure modes: CPU saturation, storage I/O contention, and network-related delays to cloud storage or metastore services.
Show how you would isolate whether the issue is in ingestion, shuffle, checkpointing, Delta commit, or downstream orchestration.
Include monitoring, alerting, and rollback/runbook recommendations.

Constraints

Must use Databricks-native services where possible
No packet capture on regulated payload data beyond metadata-safe filters
On-call team is small: 1 DevOps engineer and 2 data platform engineers
Incremental cloud spend for observability must stay under $8K/month

Context

Scale Requirements

Input volume: 12 TB/day of JSON and Parquet landing in cloud object storage
Streaming rate: 180K records/sec peak through Auto Loader
Batch jobs: 40 scheduled Databricks Workflows per hour
Latency SLA: Bronze to Silver < 8 minutes; Silver to Gold < 20 minutes
Cluster size: 20-80 worker nodes, i3en/Storage Optimized class equivalent
Retention: 30 days raw, 1 year curated Delta tables

Requirements

Propose a step-by-step method to diagnose slow Databricks pipeline runs using top, strace, lsof, tcpdump, and iostat on driver and worker nodes.
Explain how you would correlate OS-level findings with Databricks metrics: Ganglia/metrics, Spark UI, event logs, Delta transaction logs, and Databricks Workflows run history.
Cover at least three failure modes: CPU saturation, storage I/O contention, and network-related delays to cloud storage or metastore services.
Show how you would isolate whether the issue is in ingestion, shuffle, checkpointing, Delta commit, or downstream orchestration.
Include monitoring, alerting, and rollback/runbook recommendations.

Constraints

Must use Databricks-native services where possible
No packet capture on regulated payload data beyond metadata-safe filters
On-call team is small: 1 DevOps engineer and 2 data platform engineers
Incremental cloud spend for observability must stay under $8K/month

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Diagnose Databricks Pipeline Bottlenecks

Context

Scale Requirements

Requirements

Constraints

Your Answer

Diagnose Databricks Pipeline Bottlenecks

Context

Scale Requirements

Requirements

Constraints

Diagnose Databricks Pipeline Bottlenecks

Context

Scale Requirements

Requirements

Constraints

Your Answer