Design Databricks Streaming ETL Pipeline

Context

A SaaS company running on the Databricks Data Intelligence Platform currently ingests application events and CDC updates with hourly batch jobs into Delta tables. Product analytics, fraud detection, and operational dashboards need fresher data, so the team wants to standardize on Databricks Structured Streaming with Delta Live Tables / Lakeflow Declarative Pipelines and Unity Catalog.

You are asked to design a production-grade streaming pipeline on Databricks that explains how Spark Structured Streaming should be used end to end, including ingestion, stateful processing, data quality, checkpointing, and downstream serving.

Scale Requirements

Input sources: Kafka event stream + Debezium CDC landing in cloud object storage
Throughput: 250K events/sec peak, 40K rows/sec CDC peak
Event size: 1-3 KB JSON payloads
Latency target: Bronze to Silver in < 90 seconds; Gold aggregates in < 3 minutes
Daily volume: ~18 TB raw data/day
Retention: 30 days raw replay, 1 year curated Delta tables

Requirements

Design a Databricks-native streaming architecture using Auto Loader, Structured Streaming, Delta Lake, and Unity Catalog.
Explain how to implement exactly-once or effectively-once processing with checkpointing, idempotent writes, and deduplication.
Handle late and out-of-order events using watermarks and stateful aggregations.
Define Bronze, Silver, and Gold transformations, including schema evolution and CDC merge logic.
Add data quality expectations, quarantine handling, and replay/backfill strategy.
Describe orchestration with Databricks Workflows or Lakeflow, plus deployment across dev/staging/prod.
Specify monitoring, alerting, and recovery for failed micro-batches, lag, and bad records.

Constraints

Must stay primarily within the Databricks ecosystem; minimize external operational overhead.
PII data must be governed through Unity Catalog and support auditability.
Budget allows autoscaling compute, but long-running overprovisioned clusters are discouraged.
The design must support both continuous ingestion and historical backfills without duplicating records.

Context

Scale Requirements

Input sources: Kafka event stream + Debezium CDC landing in cloud object storage
Throughput: 250K events/sec peak, 40K rows/sec CDC peak
Event size: 1-3 KB JSON payloads
Latency target: Bronze to Silver in < 90 seconds; Gold aggregates in < 3 minutes
Daily volume: ~18 TB raw data/day
Retention: 30 days raw replay, 1 year curated Delta tables

Requirements

Design a Databricks-native streaming architecture using Auto Loader, Structured Streaming, Delta Lake, and Unity Catalog.
Explain how to implement exactly-once or effectively-once processing with checkpointing, idempotent writes, and deduplication.
Handle late and out-of-order events using watermarks and stateful aggregations.
Define Bronze, Silver, and Gold transformations, including schema evolution and CDC merge logic.
Add data quality expectations, quarantine handling, and replay/backfill strategy.
Describe orchestration with Databricks Workflows or Lakeflow, plus deployment across dev/staging/prod.
Specify monitoring, alerting, and recovery for failed micro-batches, lag, and bad records.

Constraints

Must stay primarily within the Databricks ecosystem; minimize external operational overhead.
PII data must be governed through Unity Catalog and support auditability.
Budget allows autoscaling compute, but long-running overprovisioned clusters are discouraged.
The design must support both continuous ingestion and historical backfills without duplicating records.

Context

Scale Requirements

Input sources: Kafka event stream + Debezium CDC landing in cloud object storage
Throughput: 250K events/sec peak, 40K rows/sec CDC peak
Event size: 1-3 KB JSON payloads
Latency target: Bronze to Silver in < 90 seconds; Gold aggregates in < 3 minutes
Daily volume: ~18 TB raw data/day
Retention: 30 days raw replay, 1 year curated Delta tables

Requirements

Design a Databricks-native streaming architecture using Auto Loader, Structured Streaming, Delta Lake, and Unity Catalog.
Explain how to implement exactly-once or effectively-once processing with checkpointing, idempotent writes, and deduplication.
Handle late and out-of-order events using watermarks and stateful aggregations.
Define Bronze, Silver, and Gold transformations, including schema evolution and CDC merge logic.
Add data quality expectations, quarantine handling, and replay/backfill strategy.
Describe orchestration with Databricks Workflows or Lakeflow, plus deployment across dev/staging/prod.
Specify monitoring, alerting, and recovery for failed micro-batches, lag, and bad records.

Constraints

Must stay primarily within the Databricks ecosystem; minimize external operational overhead.
PII data must be governed through Unity Catalog and support auditability.
Budget allows autoscaling compute, but long-running overprovisioned clusters are discouraged.
The design must support both continuous ingestion and historical backfills without duplicating records.

Context

Scale Requirements

Input sources: Kafka event stream + Debezium CDC landing in cloud object storage
Throughput: 250K events/sec peak, 40K rows/sec CDC peak
Event size: 1-3 KB JSON payloads
Latency target: Bronze to Silver in < 90 seconds; Gold aggregates in < 3 minutes
Daily volume: ~18 TB raw data/day
Retention: 30 days raw replay, 1 year curated Delta tables

Requirements

Design a Databricks-native streaming architecture using Auto Loader, Structured Streaming, Delta Lake, and Unity Catalog.
Explain how to implement exactly-once or effectively-once processing with checkpointing, idempotent writes, and deduplication.
Handle late and out-of-order events using watermarks and stateful aggregations.
Define Bronze, Silver, and Gold transformations, including schema evolution and CDC merge logic.
Add data quality expectations, quarantine handling, and replay/backfill strategy.
Describe orchestration with Databricks Workflows or Lakeflow, plus deployment across dev/staging/prod.
Specify monitoring, alerting, and recovery for failed micro-batches, lag, and bad records.

Constraints

Must stay primarily within the Databricks ecosystem; minimize external operational overhead.
PII data must be governed through Unity Catalog and support auditability.
Budget allows autoscaling compute, but long-running overprovisioned clusters are discouraged.
The design must support both continuous ingestion and historical backfills without duplicating records.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Databricks Streaming ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Databricks Streaming ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Databricks Streaming ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer