Context
Databricks operates large-scale platform services that emit infrastructure, audit, application, and query execution logs from multiple regions. The current mix of regional collectors and hourly batch compaction is no longer sufficient for security analytics, customer-facing observability, and incident response, which now require near-real-time processing on the Databricks Lakehouse.
You are asked to design a Databricks-native pipeline to ingest, validate, enrich, and serve petabytes of log data in real time using Delta Lake and managed Databricks services.
Scale Requirements
- Ingress throughput: 15-25 million log events/second peak globally
- Event size: 0.8-2.5 KB compressed JSON or protobuf
- Daily volume: 1.5-3 PB/day raw logs
- Latency target: P95 < 90 seconds from emission to queryable curated tables
- Retention: 30 days hot, 1 year cold archive
- Availability target: 99.95% pipeline uptime across regions
Requirements
- Design a streaming ingestion layer that can absorb multi-region bursts without data loss.
- Use Databricks Auto Loader, Delta Lake, and Structured Streaming to land raw logs and build curated tables.
- Support schema evolution, deduplication, late-arriving events, and replay/backfill for up to 14 days.
- Implement data quality controls for malformed payloads, missing required fields, duplicate event IDs, and clock skew.
- Expose outputs for both operational dashboards and downstream security/compliance analytics.
- Define orchestration using Databricks Workflows for streaming jobs, compaction, backfills, and recovery tasks.
- Explain partitioning, file sizing, checkpointing, and state management decisions at this scale.
- Describe monitoring, alerting, and on-call response for lag, bad records, cost spikes, and regional failures.
Constraints
- Primary platform should stay within the Databricks ecosystem where possible.
- Logs may contain sensitive identifiers; encryption, access control, and auditability are required.
- Cross-region network egress should be minimized.
- The design must support tenant isolation and selective deletion for compliance workloads.
- Assume an engineering team of 6-8 engineers and a hard requirement to avoid long operational runbooks.