Context
Databricks runs internal platform telemetry pipelines that ingest workspace audit logs, cluster events, and job run metadata for operational analytics and customer-facing usage reporting. The current design relies on hourly batch ingestion into Delta Lake, but product and support teams now need near-real-time visibility while preserving correctness during regional network partitions and downstream outages.
You are asked to design a Databricks-native pipeline and explain how CAP theorem trade-offs influence your choices. Assume some upstream systems are distributed across regions and may become partitioned; your pipeline must make explicit decisions about when to favor availability versus consistency.
Scale Requirements
- Sources: audit logs, job events, billing usage, cluster metrics
- Peak throughput: 250K events/sec across regions, 40K avg
- Event size: 1-4 KB JSON
- Daily volume: ~12 TB raw, ~4 PB retained over 13 months
- Freshness target: Bronze < 60 seconds, Silver < 5 minutes, Gold < 15 minutes
- Query SLA: 99.9% of dashboard queries under 3 seconds on Gold tables
Requirements
- Design an end-to-end Databricks pipeline using Databricks Auto Loader, Delta Lake, and Databricks Workflows for ingestion, transformation, and orchestration.
- Show how the system handles network partitions between regions, delayed source delivery, and duplicate event replay.
- Define where the design prioritizes consistency (for example, billing-grade aggregates) versus availability (for example, operational monitoring dashboards).
- Implement idempotent ingestion, schema evolution handling, and replay/backfill support for 30 days of historical data.
- Include data quality controls for malformed JSON, null primary keys, out-of-order events, and source drift.
- Describe monitoring, alerting, and recovery procedures for streaming lag, failed checkpoints, and Delta write conflicts.
Constraints
- Must run primarily on the Databricks Data Intelligence Platform on AWS.
- Prefer Delta Lake tables over external warehouse-first designs.
- Budget target: incremental platform cost under $35K/month.
- Compliance: SOC 2 auditability and deletion support within 7 days for selected customer metadata.
- Engineering team: 5 data engineers, 1 platform engineer; solution should minimize operational overhead.