Context
Databricks runs a production lakehouse pipeline that ingests application logs and CDC data into Delta tables using Databricks Workflows, Auto Loader, and Delta Live Tables. Over the last two weeks, several jobs have missed their SLA, but Spark UI symptoms are inconsistent: some runs show executor under-utilization, others stall during file discovery, checkpoint commits, or downstream table writes.
You are asked to design an operational debugging approach for a DevOps Engineer supporting Databricks pipelines. The goal is not only to identify whether the bottleneck is CPU, memory, disk, file descriptors, or network, but also to connect Linux-level signals from tools such as top, strace, lsof, tcpdump, and iostat to Databricks-specific pipeline stages and remediation actions.
Scale Requirements
- Input volume: 12 TB/day of JSON and Parquet landing in cloud object storage
- Streaming rate: 180K records/sec peak through Auto Loader
- Batch jobs: 40 scheduled Databricks Workflows per hour
- Latency SLA: Bronze to Silver < 8 minutes; Silver to Gold < 20 minutes
- Cluster size: 20-80 worker nodes, i3en/Storage Optimized class equivalent
- Retention: 30 days raw, 1 year curated Delta tables
Requirements
- Propose a step-by-step method to diagnose slow Databricks pipeline runs using
top, strace, lsof, tcpdump, and iostat on driver and worker nodes.
- Explain how you would correlate OS-level findings with Databricks metrics: Ganglia/metrics, Spark UI, event logs, Delta transaction logs, and Databricks Workflows run history.
- Cover at least three failure modes: CPU saturation, storage I/O contention, and network-related delays to cloud storage or metastore services.
- Show how you would isolate whether the issue is in ingestion, shuffle, checkpointing, Delta commit, or downstream orchestration.
- Include monitoring, alerting, and rollback/runbook recommendations.
Constraints
- Must use Databricks-native services where possible
- No packet capture on regulated payload data beyond metadata-safe filters
- On-call team is small: 1 DevOps engineer and 2 data platform engineers
- Incremental cloud spend for observability must stay under $8K/month