Design CAP-Aware Delta Pipeline

Context

Databricks runs internal platform telemetry pipelines that ingest workspace audit logs, cluster events, and job run metadata for operational analytics and customer-facing usage reporting. The current design relies on hourly batch ingestion into Delta Lake, but product and support teams now need near-real-time visibility while preserving correctness during regional network partitions and downstream outages.

You are asked to design a Databricks-native pipeline and explain how CAP theorem trade-offs influence your choices. Assume some upstream systems are distributed across regions and may become partitioned; your pipeline must make explicit decisions about when to favor availability versus consistency.

Scale Requirements

Sources: audit logs, job events, billing usage, cluster metrics
Peak throughput: 250K events/sec across regions, 40K avg
Event size: 1-4 KB JSON
Daily volume: ~12 TB raw, ~4 PB retained over 13 months
Freshness target: Bronze < 60 seconds, Silver < 5 minutes, Gold < 15 minutes
Query SLA: 99.9% of dashboard queries under 3 seconds on Gold tables

Requirements

Design an end-to-end Databricks pipeline using Databricks Auto Loader, Delta Lake, and Databricks Workflows for ingestion, transformation, and orchestration.
Show how the system handles network partitions between regions, delayed source delivery, and duplicate event replay.
Define where the design prioritizes consistency (for example, billing-grade aggregates) versus availability (for example, operational monitoring dashboards).
Implement idempotent ingestion, schema evolution handling, and replay/backfill support for 30 days of historical data.
Include data quality controls for malformed JSON, null primary keys, out-of-order events, and source drift.
Describe monitoring, alerting, and recovery procedures for streaming lag, failed checkpoints, and Delta write conflicts.

Constraints

Must run primarily on the Databricks Data Intelligence Platform on AWS.
Prefer Delta Lake tables over external warehouse-first designs.
Budget target: incremental platform cost under $35K/month.
Compliance: SOC 2 auditability and deletion support within 7 days for selected customer metadata.
Engineering team: 5 data engineers, 1 platform engineer; solution should minimize operational overhead.

Context

Scale Requirements

Sources: audit logs, job events, billing usage, cluster metrics
Peak throughput: 250K events/sec across regions, 40K avg
Event size: 1-4 KB JSON
Daily volume: ~12 TB raw, ~4 PB retained over 13 months
Freshness target: Bronze < 60 seconds, Silver < 5 minutes, Gold < 15 minutes
Query SLA: 99.9% of dashboard queries under 3 seconds on Gold tables

Requirements

Design an end-to-end Databricks pipeline using Databricks Auto Loader, Delta Lake, and Databricks Workflows for ingestion, transformation, and orchestration.
Show how the system handles network partitions between regions, delayed source delivery, and duplicate event replay.
Define where the design prioritizes consistency (for example, billing-grade aggregates) versus availability (for example, operational monitoring dashboards).
Implement idempotent ingestion, schema evolution handling, and replay/backfill support for 30 days of historical data.
Include data quality controls for malformed JSON, null primary keys, out-of-order events, and source drift.
Describe monitoring, alerting, and recovery procedures for streaming lag, failed checkpoints, and Delta write conflicts.

Constraints

Must run primarily on the Databricks Data Intelligence Platform on AWS.
Prefer Delta Lake tables over external warehouse-first designs.
Budget target: incremental platform cost under $35K/month.
Compliance: SOC 2 auditability and deletion support within 7 days for selected customer metadata.
Engineering team: 5 data engineers, 1 platform engineer; solution should minimize operational overhead.

Context

Scale Requirements

Sources: audit logs, job events, billing usage, cluster metrics
Peak throughput: 250K events/sec across regions, 40K avg
Event size: 1-4 KB JSON
Daily volume: ~12 TB raw, ~4 PB retained over 13 months
Freshness target: Bronze < 60 seconds, Silver < 5 minutes, Gold < 15 minutes
Query SLA: 99.9% of dashboard queries under 3 seconds on Gold tables

Requirements

Design an end-to-end Databricks pipeline using Databricks Auto Loader, Delta Lake, and Databricks Workflows for ingestion, transformation, and orchestration.
Show how the system handles network partitions between regions, delayed source delivery, and duplicate event replay.
Define where the design prioritizes consistency (for example, billing-grade aggregates) versus availability (for example, operational monitoring dashboards).
Implement idempotent ingestion, schema evolution handling, and replay/backfill support for 30 days of historical data.
Include data quality controls for malformed JSON, null primary keys, out-of-order events, and source drift.
Describe monitoring, alerting, and recovery procedures for streaming lag, failed checkpoints, and Delta write conflicts.

Constraints

Must run primarily on the Databricks Data Intelligence Platform on AWS.
Prefer Delta Lake tables over external warehouse-first designs.
Budget target: incremental platform cost under $35K/month.
Compliance: SOC 2 auditability and deletion support within 7 days for selected customer metadata.
Engineering team: 5 data engineers, 1 platform engineer; solution should minimize operational overhead.

Context

Scale Requirements

Sources: audit logs, job events, billing usage, cluster metrics
Peak throughput: 250K events/sec across regions, 40K avg
Event size: 1-4 KB JSON
Daily volume: ~12 TB raw, ~4 PB retained over 13 months
Freshness target: Bronze < 60 seconds, Silver < 5 minutes, Gold < 15 minutes
Query SLA: 99.9% of dashboard queries under 3 seconds on Gold tables

Requirements

Design an end-to-end Databricks pipeline using Databricks Auto Loader, Delta Lake, and Databricks Workflows for ingestion, transformation, and orchestration.
Show how the system handles network partitions between regions, delayed source delivery, and duplicate event replay.
Define where the design prioritizes consistency (for example, billing-grade aggregates) versus availability (for example, operational monitoring dashboards).
Implement idempotent ingestion, schema evolution handling, and replay/backfill support for 30 days of historical data.
Include data quality controls for malformed JSON, null primary keys, out-of-order events, and source drift.
Describe monitoring, alerting, and recovery procedures for streaming lag, failed checkpoints, and Delta write conflicts.

Constraints

Must run primarily on the Databricks Data Intelligence Platform on AWS.
Prefer Delta Lake tables over external warehouse-first designs.
Budget target: incremental platform cost under $35K/month.
Compliance: SOC 2 auditability and deletion support within 7 days for selected customer metadata.
Engineering team: 5 data engineers, 1 platform engineer; solution should minimize operational overhead.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design CAP-Aware Delta Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design CAP-Aware Delta Pipeline

Context

Scale Requirements

Requirements

Constraints

Design CAP-Aware Delta Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer