Context
Databricks runs a lakehouse-based data platform that ingests product telemetry, audit logs, and billing events into Delta tables used by internal operations and customer-facing reporting. The current setup uses Databricks Jobs, Delta Live Tables, and Structured Streaming, but observability is fragmented across job logs, ad hoc dashboards, and inconsistent alerts, making it hard to detect regressions, prevent SLA misses, and plan cluster capacity.
You are asked to design a continuous-improvement program for pipeline observability, alerting, and capacity planning across both batch and streaming workloads.
Scale Requirements
- Sources: 120+ upstream producers across product telemetry, REST APIs, and CDC feeds
- Throughput: 180K events/sec peak streaming ingest, 35K avg
- Batch volume: 75 TB/day across bronze, silver, and gold Delta tables
- Pipelines: 250 scheduled Databricks Jobs, 40 Delta Live Tables pipelines, 18 always-on streaming queries
- Latency targets: streaming freshness < 2 minutes P95; batch completion by 6:00 AM UTC
- Retention: 180 days raw, 2 years curated metrics and audit history
Requirements
- Define a standardized observability architecture for Databricks Jobs, Delta Live Tables, and Structured Streaming.
- Propose pipeline-level SLIs/SLOs for freshness, success rate, data quality, and resource utilization.
- Design actionable alerting to reduce noisy pages while catching SLA risk early.
- Include data quality monitoring for schema drift, null spikes, duplicate records, and delayed upstream delivery.
- Explain how you would use historical workload metrics to forecast capacity and right-size job clusters or serverless usage.
- Describe how you would roll out improvements incrementally without disrupting existing production pipelines.
- Show how failed runs, replay/backfill jobs, and late-arriving data are tracked distinctly in reporting.
Constraints
- Must prefer native Databricks capabilities where possible
- On-call team is small: 6 engineers across regions
- Budget target: keep monthly platform cost growth under 12%
- Auditability is required for incident reviews and compliance
- Some pipelines process PII and must restrict metric payloads and logs