Design Petabyte-Scale Log Streaming Pipeline

Context

Databricks operates large-scale platform services that emit infrastructure, audit, application, and query execution logs from multiple regions. The current mix of regional collectors and hourly batch compaction is no longer sufficient for security analytics, customer-facing observability, and incident response, which now require near-real-time processing on the Databricks Lakehouse.

You are asked to design a Databricks-native pipeline to ingest, validate, enrich, and serve petabytes of log data in real time using Delta Lake and managed Databricks services.

Scale Requirements

Ingress throughput: 15-25 million log events/second peak globally
Event size: 0.8-2.5 KB compressed JSON or protobuf
Daily volume: 1.5-3 PB/day raw logs
Latency target: P95 < 90 seconds from emission to queryable curated tables
Retention: 30 days hot, 1 year cold archive
Availability target: 99.95% pipeline uptime across regions

Requirements

Design a streaming ingestion layer that can absorb multi-region bursts without data loss.
Use Databricks Auto Loader, Delta Lake, and Structured Streaming to land raw logs and build curated tables.
Support schema evolution, deduplication, late-arriving events, and replay/backfill for up to 14 days.
Implement data quality controls for malformed payloads, missing required fields, duplicate event IDs, and clock skew.
Expose outputs for both operational dashboards and downstream security/compliance analytics.
Define orchestration using Databricks Workflows for streaming jobs, compaction, backfills, and recovery tasks.
Explain partitioning, file sizing, checkpointing, and state management decisions at this scale.
Describe monitoring, alerting, and on-call response for lag, bad records, cost spikes, and regional failures.

Constraints

Primary platform should stay within the Databricks ecosystem where possible.
Logs may contain sensitive identifiers; encryption, access control, and auditability are required.
Cross-region network egress should be minimized.
The design must support tenant isolation and selective deletion for compliance workloads.
Assume an engineering team of 6-8 engineers and a hard requirement to avoid long operational runbooks.

Context

You are asked to design a Databricks-native pipeline to ingest, validate, enrich, and serve petabytes of log data in real time using Delta Lake and managed Databricks services.

Scale Requirements

Ingress throughput: 15-25 million log events/second peak globally
Event size: 0.8-2.5 KB compressed JSON or protobuf
Daily volume: 1.5-3 PB/day raw logs
Latency target: P95 < 90 seconds from emission to queryable curated tables
Retention: 30 days hot, 1 year cold archive
Availability target: 99.95% pipeline uptime across regions

Requirements

Design a streaming ingestion layer that can absorb multi-region bursts without data loss.
Use Databricks Auto Loader, Delta Lake, and Structured Streaming to land raw logs and build curated tables.
Support schema evolution, deduplication, late-arriving events, and replay/backfill for up to 14 days.
Implement data quality controls for malformed payloads, missing required fields, duplicate event IDs, and clock skew.
Expose outputs for both operational dashboards and downstream security/compliance analytics.
Define orchestration using Databricks Workflows for streaming jobs, compaction, backfills, and recovery tasks.
Explain partitioning, file sizing, checkpointing, and state management decisions at this scale.
Describe monitoring, alerting, and on-call response for lag, bad records, cost spikes, and regional failures.

Constraints

Primary platform should stay within the Databricks ecosystem where possible.
Logs may contain sensitive identifiers; encryption, access control, and auditability are required.
Cross-region network egress should be minimized.
The design must support tenant isolation and selective deletion for compliance workloads.
Assume an engineering team of 6-8 engineers and a hard requirement to avoid long operational runbooks.

Context

You are asked to design a Databricks-native pipeline to ingest, validate, enrich, and serve petabytes of log data in real time using Delta Lake and managed Databricks services.

Scale Requirements

Ingress throughput: 15-25 million log events/second peak globally
Event size: 0.8-2.5 KB compressed JSON or protobuf
Daily volume: 1.5-3 PB/day raw logs
Latency target: P95 < 90 seconds from emission to queryable curated tables
Retention: 30 days hot, 1 year cold archive
Availability target: 99.95% pipeline uptime across regions

Requirements

Design a streaming ingestion layer that can absorb multi-region bursts without data loss.
Use Databricks Auto Loader, Delta Lake, and Structured Streaming to land raw logs and build curated tables.
Support schema evolution, deduplication, late-arriving events, and replay/backfill for up to 14 days.
Implement data quality controls for malformed payloads, missing required fields, duplicate event IDs, and clock skew.
Expose outputs for both operational dashboards and downstream security/compliance analytics.
Define orchestration using Databricks Workflows for streaming jobs, compaction, backfills, and recovery tasks.
Explain partitioning, file sizing, checkpointing, and state management decisions at this scale.
Describe monitoring, alerting, and on-call response for lag, bad records, cost spikes, and regional failures.

Constraints

Primary platform should stay within the Databricks ecosystem where possible.
Logs may contain sensitive identifiers; encryption, access control, and auditability are required.
Cross-region network egress should be minimized.
The design must support tenant isolation and selective deletion for compliance workloads.
Assume an engineering team of 6-8 engineers and a hard requirement to avoid long operational runbooks.

Context

You are asked to design a Databricks-native pipeline to ingest, validate, enrich, and serve petabytes of log data in real time using Delta Lake and managed Databricks services.

Scale Requirements

Ingress throughput: 15-25 million log events/second peak globally
Event size: 0.8-2.5 KB compressed JSON or protobuf
Daily volume: 1.5-3 PB/day raw logs
Latency target: P95 < 90 seconds from emission to queryable curated tables
Retention: 30 days hot, 1 year cold archive
Availability target: 99.95% pipeline uptime across regions

Requirements

Design a streaming ingestion layer that can absorb multi-region bursts without data loss.
Use Databricks Auto Loader, Delta Lake, and Structured Streaming to land raw logs and build curated tables.
Support schema evolution, deduplication, late-arriving events, and replay/backfill for up to 14 days.
Implement data quality controls for malformed payloads, missing required fields, duplicate event IDs, and clock skew.
Expose outputs for both operational dashboards and downstream security/compliance analytics.
Define orchestration using Databricks Workflows for streaming jobs, compaction, backfills, and recovery tasks.
Explain partitioning, file sizing, checkpointing, and state management decisions at this scale.
Describe monitoring, alerting, and on-call response for lag, bad records, cost spikes, and regional failures.

Constraints

Primary platform should stay within the Databricks ecosystem where possible.
Logs may contain sensitive identifiers; encryption, access control, and auditability are required.
Cross-region network egress should be minimized.
The design must support tenant isolation and selective deletion for compliance workloads.
Assume an engineering team of 6-8 engineers and a hard requirement to avoid long operational runbooks.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Petabyte-Scale Log Streaming Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Petabyte-Scale Log Streaming Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Petabyte-Scale Log Streaming Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer