Improve Pipeline Observability and Capacity

Context

Databricks runs a lakehouse-based data platform that ingests product telemetry, audit logs, and billing events into Delta tables used by internal operations and customer-facing reporting. The current setup uses Databricks Jobs, Delta Live Tables, and Structured Streaming, but observability is fragmented across job logs, ad hoc dashboards, and inconsistent alerts, making it hard to detect regressions, prevent SLA misses, and plan cluster capacity.

You are asked to design a continuous-improvement program for pipeline observability, alerting, and capacity planning across both batch and streaming workloads.

Scale Requirements

Sources: 120+ upstream producers across product telemetry, REST APIs, and CDC feeds
Throughput: 180K events/sec peak streaming ingest, 35K avg
Batch volume: 75 TB/day across bronze, silver, and gold Delta tables
Pipelines: 250 scheduled Databricks Jobs, 40 Delta Live Tables pipelines, 18 always-on streaming queries
Latency targets: streaming freshness < 2 minutes P95; batch completion by 6:00 AM UTC
Retention: 180 days raw, 2 years curated metrics and audit history

Requirements

Define a standardized observability architecture for Databricks Jobs, Delta Live Tables, and Structured Streaming.
Propose pipeline-level SLIs/SLOs for freshness, success rate, data quality, and resource utilization.
Design actionable alerting to reduce noisy pages while catching SLA risk early.
Include data quality monitoring for schema drift, null spikes, duplicate records, and delayed upstream delivery.
Explain how you would use historical workload metrics to forecast capacity and right-size job clusters or serverless usage.
Describe how you would roll out improvements incrementally without disrupting existing production pipelines.
Show how failed runs, replay/backfill jobs, and late-arriving data are tracked distinctly in reporting.

Constraints

Must prefer native Databricks capabilities where possible
On-call team is small: 6 engineers across regions
Budget target: keep monthly platform cost growth under 12%
Auditability is required for incident reviews and compliance
Some pipelines process PII and must restrict metric payloads and logs

Context

You are asked to design a continuous-improvement program for pipeline observability, alerting, and capacity planning across both batch and streaming workloads.

Scale Requirements

Sources: 120+ upstream producers across product telemetry, REST APIs, and CDC feeds
Throughput: 180K events/sec peak streaming ingest, 35K avg
Batch volume: 75 TB/day across bronze, silver, and gold Delta tables
Pipelines: 250 scheduled Databricks Jobs, 40 Delta Live Tables pipelines, 18 always-on streaming queries
Latency targets: streaming freshness < 2 minutes P95; batch completion by 6:00 AM UTC
Retention: 180 days raw, 2 years curated metrics and audit history

Requirements

Define a standardized observability architecture for Databricks Jobs, Delta Live Tables, and Structured Streaming.
Propose pipeline-level SLIs/SLOs for freshness, success rate, data quality, and resource utilization.
Design actionable alerting to reduce noisy pages while catching SLA risk early.
Include data quality monitoring for schema drift, null spikes, duplicate records, and delayed upstream delivery.
Explain how you would use historical workload metrics to forecast capacity and right-size job clusters or serverless usage.
Describe how you would roll out improvements incrementally without disrupting existing production pipelines.
Show how failed runs, replay/backfill jobs, and late-arriving data are tracked distinctly in reporting.

Constraints

Must prefer native Databricks capabilities where possible
On-call team is small: 6 engineers across regions
Budget target: keep monthly platform cost growth under 12%
Auditability is required for incident reviews and compliance
Some pipelines process PII and must restrict metric payloads and logs

Context

You are asked to design a continuous-improvement program for pipeline observability, alerting, and capacity planning across both batch and streaming workloads.

Scale Requirements

Sources: 120+ upstream producers across product telemetry, REST APIs, and CDC feeds
Throughput: 180K events/sec peak streaming ingest, 35K avg
Batch volume: 75 TB/day across bronze, silver, and gold Delta tables
Pipelines: 250 scheduled Databricks Jobs, 40 Delta Live Tables pipelines, 18 always-on streaming queries
Latency targets: streaming freshness < 2 minutes P95; batch completion by 6:00 AM UTC
Retention: 180 days raw, 2 years curated metrics and audit history

Requirements

Define a standardized observability architecture for Databricks Jobs, Delta Live Tables, and Structured Streaming.
Propose pipeline-level SLIs/SLOs for freshness, success rate, data quality, and resource utilization.
Design actionable alerting to reduce noisy pages while catching SLA risk early.
Include data quality monitoring for schema drift, null spikes, duplicate records, and delayed upstream delivery.
Explain how you would use historical workload metrics to forecast capacity and right-size job clusters or serverless usage.
Describe how you would roll out improvements incrementally without disrupting existing production pipelines.
Show how failed runs, replay/backfill jobs, and late-arriving data are tracked distinctly in reporting.

Constraints

Must prefer native Databricks capabilities where possible
On-call team is small: 6 engineers across regions
Budget target: keep monthly platform cost growth under 12%
Auditability is required for incident reviews and compliance
Some pipelines process PII and must restrict metric payloads and logs

Context

You are asked to design a continuous-improvement program for pipeline observability, alerting, and capacity planning across both batch and streaming workloads.

Scale Requirements

Sources: 120+ upstream producers across product telemetry, REST APIs, and CDC feeds
Throughput: 180K events/sec peak streaming ingest, 35K avg
Batch volume: 75 TB/day across bronze, silver, and gold Delta tables
Pipelines: 250 scheduled Databricks Jobs, 40 Delta Live Tables pipelines, 18 always-on streaming queries
Latency targets: streaming freshness < 2 minutes P95; batch completion by 6:00 AM UTC
Retention: 180 days raw, 2 years curated metrics and audit history

Requirements

Define a standardized observability architecture for Databricks Jobs, Delta Live Tables, and Structured Streaming.
Propose pipeline-level SLIs/SLOs for freshness, success rate, data quality, and resource utilization.
Design actionable alerting to reduce noisy pages while catching SLA risk early.
Include data quality monitoring for schema drift, null spikes, duplicate records, and delayed upstream delivery.
Explain how you would use historical workload metrics to forecast capacity and right-size job clusters or serverless usage.
Describe how you would roll out improvements incrementally without disrupting existing production pipelines.
Show how failed runs, replay/backfill jobs, and late-arriving data are tracked distinctly in reporting.

Constraints

Must prefer native Databricks capabilities where possible
On-call team is small: 6 engineers across regions
Budget target: keep monthly platform cost growth under 12%
Auditability is required for incident reviews and compliance
Some pipelines process PII and must restrict metric payloads and logs

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Improve Pipeline Observability and Capacity

Context

Scale Requirements

Requirements

Constraints

Your Answer

Improve Pipeline Observability and Capacity

Context

Scale Requirements

Requirements

Constraints

Improve Pipeline Observability and Capacity

Context

Scale Requirements

Requirements

Constraints

Your Answer