Context
A SaaS company running on the Databricks Data Intelligence Platform currently ingests application events and CDC updates with hourly batch jobs into Delta tables. Product analytics, fraud detection, and operational dashboards need fresher data, so the team wants to standardize on Databricks Structured Streaming with Delta Live Tables / Lakeflow Declarative Pipelines and Unity Catalog.
You are asked to design a production-grade streaming pipeline on Databricks that explains how Spark Structured Streaming should be used end to end, including ingestion, stateful processing, data quality, checkpointing, and downstream serving.
Scale Requirements
- Input sources: Kafka event stream + Debezium CDC landing in cloud object storage
- Throughput: 250K events/sec peak, 40K rows/sec CDC peak
- Event size: 1-3 KB JSON payloads
- Latency target: Bronze to Silver in < 90 seconds; Gold aggregates in < 3 minutes
- Daily volume: ~18 TB raw data/day
- Retention: 30 days raw replay, 1 year curated Delta tables
Requirements
- Design a Databricks-native streaming architecture using Auto Loader, Structured Streaming, Delta Lake, and Unity Catalog.
- Explain how to implement exactly-once or effectively-once processing with checkpointing, idempotent writes, and deduplication.
- Handle late and out-of-order events using watermarks and stateful aggregations.
- Define Bronze, Silver, and Gold transformations, including schema evolution and CDC merge logic.
- Add data quality expectations, quarantine handling, and replay/backfill strategy.
- Describe orchestration with Databricks Workflows or Lakeflow, plus deployment across dev/staging/prod.
- Specify monitoring, alerting, and recovery for failed micro-batches, lag, and bad records.
Constraints
- Must stay primarily within the Databricks ecosystem; minimize external operational overhead.
- PII data must be governed through Unity Catalog and support auditability.
- Budget allows autoscaling compute, but long-running overprovisioned clusters are discouraged.
- The design must support both continuous ingestion and historical backfills without duplicating records.