Context
A Databricks customer runs nightly ETL on the Databricks Lakehouse using Delta Lake tables and Databricks Workflows. Their current jobs process raw billing, product usage, and account metadata into curated fact tables, but pipeline runtime has grown from 40 minutes to nearly 3 hours. The team wants a clear explanation of Spark's DAG execution model and how it should influence pipeline design, debugging, and optimization on Databricks.
Your task is to explain Spark DAG execution in the context of a production Databricks pipeline, not as a generic theory question. Assume the pipeline reads 12 TB/day from Bronze Delta tables, joins 3 large datasets, performs aggregations, and writes 8 partitioned Silver/Gold Delta tables. The SLA is 90 minutes end-to-end, with individual critical tables available within 20 minutes of source arrival. Peak cluster size is 64 workers, and the platform must support weekly backfills of up to 180 days.
Requirements
- Explain how Spark converts DataFrame or SQL transformations into a logical plan, optimized plan, physical plan, and DAG of stages/tasks on Databricks.
- Distinguish narrow vs. wide transformations and describe how shuffles create stage boundaries.
- Show how lazy evaluation affects ETL pipelines, especially when multiple actions or repeated reads are used in Databricks notebooks or Delta Live Tables-style patterns.
- Describe how DAG execution impacts join strategy, partitioning, caching, skew handling, and write performance to Delta Lake.
- Explain how you would inspect and troubleshoot the DAG using the Spark UI, Databricks query profile, event logs, and task-level metrics.
- Include concrete guidance for reducing runtime and avoiding unnecessary recomputation in a multi-step batch pipeline.
Constraints
- Use Databricks-native terminology where possible: Delta Lake, Databricks Workflows, Unity Catalog, Spark UI, Photon.
- Data contains PII governed by Unity Catalog; intermediate outputs must remain in approved storage locations.
- Budget allows only a 20% increase in compute cost, so optimization should focus on DAG-aware design rather than brute-force scaling.
- Backfills must be idempotent and must not corrupt downstream Delta tables.