Context
A Databricks customer runs ingestion and transformation pipelines for product telemetry, billing events, and account activity across AWS object storage and downstream BI workloads. Their current data lake uses raw Parquet with custom compaction and schema-management jobs, causing unreliable upserts, difficult backfills, and inconsistent batch/stream semantics.
You need to redesign the platform using an open table format and pick one of Iceberg, Delta Lake, or Hudi. For this interview, assume you choose Delta Lake on Databricks and defend that choice in the context of production pipelines.
Scale Requirements
- Ingestion: 220K events/sec peak across Kafka and CDC sources
- Daily volume: 9-12 TB raw data/day
- Latency: Bronze to Silver under 3 minutes; Gold aggregates under 10 minutes
- Storage: 2 PB retained for 18 months
- Concurrency: 300 BI users, 40 scheduled jobs, 15 concurrent streaming pipelines
Requirements
- Design Bronze, Silver, and Gold pipelines using Databricks Auto Loader, Delta Live Tables or Lakeflow Declarative Pipelines, and Unity Catalog.
- Support both append-only telemetry streams and mutable CDC feeds with inserts, updates, and deletes.
- Explain why Delta Lake is the best fit versus Iceberg and Hudi for this Databricks-centered environment.
- Include strategies for schema evolution, deduplication, idempotent reprocessing, and backfills.
- Define how you would implement data quality expectations, lineage, and operational monitoring.
- Show how downstream consumers query curated tables in Databricks SQL with minimal maintenance overhead.
Constraints
- Primary execution environment must be Databricks-managed services
- Team size: 5 data engineers, limited appetite for custom table-maintenance code
- Compliance: auditability and row-level governance required
- Budget target favors operational simplicity over multi-engine portability
- Pipelines must tolerate late-arriving data up to 72 hours and recover from partial job failures without data loss