Choose Delta for Lakehouse Pipelines | Dataford Interview Questions

Context

A Databricks customer runs ingestion and transformation pipelines for product telemetry, billing events, and account activity across AWS object storage and downstream BI workloads. Their current data lake uses raw Parquet with custom compaction and schema-management jobs, causing unreliable upserts, difficult backfills, and inconsistent batch/stream semantics.

You need to redesign the platform using an open table format and pick one of Iceberg, Delta Lake, or Hudi. For this interview, assume you choose Delta Lake on Databricks and defend that choice in the context of production pipelines.

Scale Requirements

Ingestion: 220K events/sec peak across Kafka and CDC sources
Daily volume: 9-12 TB raw data/day
Latency: Bronze to Silver under 3 minutes; Gold aggregates under 10 minutes
Storage: 2 PB retained for 18 months
Concurrency: 300 BI users, 40 scheduled jobs, 15 concurrent streaming pipelines

Requirements

Design Bronze, Silver, and Gold pipelines using Databricks Auto Loader, Delta Live Tables or Lakeflow Declarative Pipelines, and Unity Catalog.
Support both append-only telemetry streams and mutable CDC feeds with inserts, updates, and deletes.
Explain why Delta Lake is the best fit versus Iceberg and Hudi for this Databricks-centered environment.
Include strategies for schema evolution, deduplication, idempotent reprocessing, and backfills.
Define how you would implement data quality expectations, lineage, and operational monitoring.
Show how downstream consumers query curated tables in Databricks SQL with minimal maintenance overhead.

Constraints

Primary execution environment must be Databricks-managed services
Team size: 5 data engineers, limited appetite for custom table-maintenance code
Compliance: auditability and row-level governance required
Budget target favors operational simplicity over multi-engine portability
Pipelines must tolerate late-arriving data up to 72 hours and recover from partial job failures without data loss

Problem

Context

Scale Requirements

Ingestion: 220K events/sec peak across Kafka and CDC sources
Daily volume: 9-12 TB raw data/day
Latency: Bronze to Silver under 3 minutes; Gold aggregates under 10 minutes
Storage: 2 PB retained for 18 months
Concurrency: 300 BI users, 40 scheduled jobs, 15 concurrent streaming pipelines

Requirements

Design Bronze, Silver, and Gold pipelines using Databricks Auto Loader, Delta Live Tables or Lakeflow Declarative Pipelines, and Unity Catalog.
Support both append-only telemetry streams and mutable CDC feeds with inserts, updates, and deletes.
Explain why Delta Lake is the best fit versus Iceberg and Hudi for this Databricks-centered environment.
Include strategies for schema evolution, deduplication, idempotent reprocessing, and backfills.
Define how you would implement data quality expectations, lineage, and operational monitoring.
Show how downstream consumers query curated tables in Databricks SQL with minimal maintenance overhead.

Constraints

Primary execution environment must be Databricks-managed services
Team size: 5 data engineers, limited appetite for custom table-maintenance code
Compliance: auditability and row-level governance required
Budget target favors operational simplicity over multi-engine portability
Pipelines must tolerate late-arriving data up to 72 hours and recover from partial job failures without data loss

Problem

Context

Scale Requirements

Ingestion: 220K events/sec peak across Kafka and CDC sources
Daily volume: 9-12 TB raw data/day
Latency: Bronze to Silver under 3 minutes; Gold aggregates under 10 minutes
Storage: 2 PB retained for 18 months
Concurrency: 300 BI users, 40 scheduled jobs, 15 concurrent streaming pipelines

Requirements

Design Bronze, Silver, and Gold pipelines using Databricks Auto Loader, Delta Live Tables or Lakeflow Declarative Pipelines, and Unity Catalog.
Support both append-only telemetry streams and mutable CDC feeds with inserts, updates, and deletes.
Explain why Delta Lake is the best fit versus Iceberg and Hudi for this Databricks-centered environment.
Include strategies for schema evolution, deduplication, idempotent reprocessing, and backfills.
Define how you would implement data quality expectations, lineage, and operational monitoring.
Show how downstream consumers query curated tables in Databricks SQL with minimal maintenance overhead.

Constraints

Primary execution environment must be Databricks-managed services
Team size: 5 data engineers, limited appetite for custom table-maintenance code
Compliance: auditability and row-level governance required
Budget target favors operational simplicity over multi-engine portability
Pipelines must tolerate late-arriving data up to 72 hours and recover from partial job failures without data loss

Problem

Context

Scale Requirements

Ingestion: 220K events/sec peak across Kafka and CDC sources
Daily volume: 9-12 TB raw data/day
Latency: Bronze to Silver under 3 minutes; Gold aggregates under 10 minutes
Storage: 2 PB retained for 18 months
Concurrency: 300 BI users, 40 scheduled jobs, 15 concurrent streaming pipelines

Requirements

Design Bronze, Silver, and Gold pipelines using Databricks Auto Loader, Delta Live Tables or Lakeflow Declarative Pipelines, and Unity Catalog.
Support both append-only telemetry streams and mutable CDC feeds with inserts, updates, and deletes.
Explain why Delta Lake is the best fit versus Iceberg and Hudi for this Databricks-centered environment.
Include strategies for schema evolution, deduplication, idempotent reprocessing, and backfills.
Define how you would implement data quality expectations, lineage, and operational monitoring.
Show how downstream consumers query curated tables in Databricks SQL with minimal maintenance overhead.

Constraints

Primary execution environment must be Databricks-managed services
Team size: 5 data engineers, limited appetite for custom table-maintenance code
Compliance: auditability and row-level governance required
Budget target favors operational simplicity over multi-engine portability
Pipelines must tolerate late-arriving data up to 72 hours and recover from partial job failures without data loss