Context
Databricks runs a Delta Lake ETL pipeline that ingests customer billing, workspace audit, and usage events into Bronze tables, then builds Silver and Gold aggregates for FinOps and internal reporting. The current Databricks Jobs workflow intermittently misses its SLA because a small number of high-volume workspace_id and account_id keys create severe skew during joins and aggregations in Apache Spark on Databricks.
You are asked to redesign the pipeline so it remains reliable under skewed key distributions while preserving correctness and keeping operational complexity low.
Scale Requirements
- Input volume: 4.5 TB/day across 3 source domains
- Daily rows: ~18 billion records/day in Bronze, ~2.2 billion rows/day in Silver
- Peak batch size: 750 GB per hourly load
- SLA: Hourly Silver tables available within 20 minutes of source arrival
- Concurrency: 12 downstream Gold models depend on Silver completion
- Retention: Bronze 90 days, Silver 1 year, Gold 3 years
Requirements
- Design a Databricks Lakehouse pipeline using Delta Lake and Databricks Workflows that detects and mitigates data skew in joins, aggregations, and MERGE operations.
- Explain how you would identify skew using Spark UI, query plans, task metrics, and table statistics.
- Implement transformations for skewed joins between large fact tables and dimension/reference tables, including when to use broadcast joins, salting, repartitioning, or adaptive query execution.
- Ensure the pipeline is idempotent and supports backfills for a 30-day historical window.
- Define data quality checks for duplicate events, null business keys, and row-count drift before publishing Silver outputs.
- Describe how Delta tables should be partitioned or clustered to reduce shuffle pressure without creating too many small files.
Constraints
- Use Databricks-native services where possible: Delta Lake, Databricks Workflows, Unity Catalog, and Databricks SQL.
- Monthly incremental compute budget is capped at $40K.
- PII columns must remain governed under Unity Catalog with audited access.
- The team prefers solutions that avoid custom infrastructure outside Databricks unless clearly justified.