Mitigate Spark Skew in Delta Pipelines

Context

Databricks runs a Delta Lake ETL pipeline that ingests customer billing, workspace audit, and usage events into Bronze tables, then builds Silver and Gold aggregates for FinOps and internal reporting. The current Databricks Jobs workflow intermittently misses its SLA because a small number of high-volume workspace_id and account_id keys create severe skew during joins and aggregations in Apache Spark on Databricks.

You are asked to redesign the pipeline so it remains reliable under skewed key distributions while preserving correctness and keeping operational complexity low.

Scale Requirements

Input volume: 4.5 TB/day across 3 source domains
Daily rows: ~18 billion records/day in Bronze, ~2.2 billion rows/day in Silver
Peak batch size: 750 GB per hourly load
SLA: Hourly Silver tables available within 20 minutes of source arrival
Concurrency: 12 downstream Gold models depend on Silver completion
Retention: Bronze 90 days, Silver 1 year, Gold 3 years

Requirements

Design a Databricks Lakehouse pipeline using Delta Lake and Databricks Workflows that detects and mitigates data skew in joins, aggregations, and MERGE operations.
Explain how you would identify skew using Spark UI, query plans, task metrics, and table statistics.
Implement transformations for skewed joins between large fact tables and dimension/reference tables, including when to use broadcast joins, salting, repartitioning, or adaptive query execution.
Ensure the pipeline is idempotent and supports backfills for a 30-day historical window.
Define data quality checks for duplicate events, null business keys, and row-count drift before publishing Silver outputs.
Describe how Delta tables should be partitioned or clustered to reduce shuffle pressure without creating too many small files.

Constraints

Use Databricks-native services where possible: Delta Lake, Databricks Workflows, Unity Catalog, and Databricks SQL.
Monthly incremental compute budget is capped at $40K.
PII columns must remain governed under Unity Catalog with audited access.
The team prefers solutions that avoid custom infrastructure outside Databricks unless clearly justified.

Context

You are asked to redesign the pipeline so it remains reliable under skewed key distributions while preserving correctness and keeping operational complexity low.

Scale Requirements

Input volume: 4.5 TB/day across 3 source domains
Daily rows: ~18 billion records/day in Bronze, ~2.2 billion rows/day in Silver
Peak batch size: 750 GB per hourly load
SLA: Hourly Silver tables available within 20 minutes of source arrival
Concurrency: 12 downstream Gold models depend on Silver completion
Retention: Bronze 90 days, Silver 1 year, Gold 3 years

Requirements

Design a Databricks Lakehouse pipeline using Delta Lake and Databricks Workflows that detects and mitigates data skew in joins, aggregations, and MERGE operations.
Explain how you would identify skew using Spark UI, query plans, task metrics, and table statistics.
Implement transformations for skewed joins between large fact tables and dimension/reference tables, including when to use broadcast joins, salting, repartitioning, or adaptive query execution.
Ensure the pipeline is idempotent and supports backfills for a 30-day historical window.
Define data quality checks for duplicate events, null business keys, and row-count drift before publishing Silver outputs.
Describe how Delta tables should be partitioned or clustered to reduce shuffle pressure without creating too many small files.

Constraints

Use Databricks-native services where possible: Delta Lake, Databricks Workflows, Unity Catalog, and Databricks SQL.
Monthly incremental compute budget is capped at $40K.
PII columns must remain governed under Unity Catalog with audited access.
The team prefers solutions that avoid custom infrastructure outside Databricks unless clearly justified.

Context

You are asked to redesign the pipeline so it remains reliable under skewed key distributions while preserving correctness and keeping operational complexity low.

Scale Requirements

Input volume: 4.5 TB/day across 3 source domains
Daily rows: ~18 billion records/day in Bronze, ~2.2 billion rows/day in Silver
Peak batch size: 750 GB per hourly load
SLA: Hourly Silver tables available within 20 minutes of source arrival
Concurrency: 12 downstream Gold models depend on Silver completion
Retention: Bronze 90 days, Silver 1 year, Gold 3 years

Requirements

Design a Databricks Lakehouse pipeline using Delta Lake and Databricks Workflows that detects and mitigates data skew in joins, aggregations, and MERGE operations.
Explain how you would identify skew using Spark UI, query plans, task metrics, and table statistics.
Implement transformations for skewed joins between large fact tables and dimension/reference tables, including when to use broadcast joins, salting, repartitioning, or adaptive query execution.
Ensure the pipeline is idempotent and supports backfills for a 30-day historical window.
Define data quality checks for duplicate events, null business keys, and row-count drift before publishing Silver outputs.
Describe how Delta tables should be partitioned or clustered to reduce shuffle pressure without creating too many small files.

Constraints

Use Databricks-native services where possible: Delta Lake, Databricks Workflows, Unity Catalog, and Databricks SQL.
Monthly incremental compute budget is capped at $40K.
PII columns must remain governed under Unity Catalog with audited access.
The team prefers solutions that avoid custom infrastructure outside Databricks unless clearly justified.

Context

You are asked to redesign the pipeline so it remains reliable under skewed key distributions while preserving correctness and keeping operational complexity low.

Scale Requirements

Input volume: 4.5 TB/day across 3 source domains
Daily rows: ~18 billion records/day in Bronze, ~2.2 billion rows/day in Silver
Peak batch size: 750 GB per hourly load
SLA: Hourly Silver tables available within 20 minutes of source arrival
Concurrency: 12 downstream Gold models depend on Silver completion
Retention: Bronze 90 days, Silver 1 year, Gold 3 years

Requirements

Design a Databricks Lakehouse pipeline using Delta Lake and Databricks Workflows that detects and mitigates data skew in joins, aggregations, and MERGE operations.
Explain how you would identify skew using Spark UI, query plans, task metrics, and table statistics.
Implement transformations for skewed joins between large fact tables and dimension/reference tables, including when to use broadcast joins, salting, repartitioning, or adaptive query execution.
Ensure the pipeline is idempotent and supports backfills for a 30-day historical window.
Define data quality checks for duplicate events, null business keys, and row-count drift before publishing Silver outputs.
Describe how Delta tables should be partitioned or clustered to reduce shuffle pressure without creating too many small files.

Constraints

Use Databricks-native services where possible: Delta Lake, Databricks Workflows, Unity Catalog, and Databricks SQL.
Monthly incremental compute budget is capped at $40K.
PII columns must remain governed under Unity Catalog with audited access.
The team prefers solutions that avoid custom infrastructure outside Databricks unless clearly justified.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Mitigate Spark Skew in Delta Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer

Mitigate Spark Skew in Delta Pipelines

Context

Scale Requirements

Requirements

Constraints

Mitigate Spark Skew in Delta Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer