Context
A Databricks customer runs a nightly ETL pipeline on the Databricks Data Intelligence Platform to transform raw product, order, and clickstream data into Delta tables used by finance and growth analytics. The current pipeline, implemented as a Databricks Job with Spark SQL and PySpark notebooks, has regressed from 45 minutes to over 3 hours, causing SLA misses for downstream dashboards and model features.
You are asked to diagnose and redesign the pipeline for performance while preserving correctness and operational simplicity. Assume the pipeline reads from Bronze Delta tables, builds Silver business entities, and publishes Gold aggregates in Unity Catalog.
Scale Requirements
- Input volume: 12 TB/day across 3 Bronze Delta tables
- Daily rows processed: ~18 billion records
- Peak skew: top 0.5% of customer IDs account for 35% of events
- SLA: Gold tables available by 6:00 AM UTC
- Current runtime: 3+ hours; target is under 60 minutes
- Cluster: Databricks Jobs compute, 20 workers, i3.xlarge equivalent, Photon enabled where possible
- Retention: 180 days in Bronze, 2 years in Gold
Requirements
- Identify likely Spark bottlenecks across joins, shuffles, file layout, skew, and inefficient transformations.
- Propose an optimized Databricks pipeline design using Delta Lake, Databricks Workflows, and Unity Catalog.
- Explain how you would tune partitioning, caching, broadcast joins, Adaptive Query Execution, and file compaction.
- Define how to validate that optimizations do not change row counts, business logic, or idempotency.
- Include monitoring for stage runtime, shuffle spill, skewed tasks, small files, and Delta table health.
- Describe how you would handle backfills and rollback if a performance change degrades correctness.
Constraints
- Prefer native Databricks capabilities over external tools.
- Budget allows at most a 20% increase in compute spend.
- PII is governed in Unity Catalog; no unmanaged copies outside approved storage.
- The team wants notebook-to-production migration into reusable pipeline code within one quarter.