Project Context
Databricks runs a customer-facing data platform workflow that provisions jobs, serves SQL Warehouse traffic, and powers internal on-call operations. Over the last quarter, the team has had 6 production incidents where alerts fired late, dashboards lacked enough context to isolate the fault, and mean time to resolution averaged 78 minutes. You are the DevOps Engineer responsible for leading an observability improvement project across one critical production surface in 8 weeks.
The working team includes 6 engineers: 2 platform engineers, 2 SREs, 1 software engineer from the service team, and you as the execution lead. The urgency is high because the VP of Engineering wants the new monitoring baseline in place before a large enterprise customer launch next quarter.
Key Stakeholders
The Engineering Director wants faster incident detection without increasing pager fatigue. The product team wants no customer-visible downtime during rollout. Security wants auditability for alert changes and access controls. Finance has capped incremental tooling spend, while the on-call team wants fewer noisy alerts and better runbooks.
Constraints
- Timeline: 8 weeks, with production readiness review in Week 7
- Budget: $45,000 incremental spend maximum
- Team capacity: 6 people, but 2 are only available 50% due to ongoing incident support
- Existing stack: Databricks Lakehouse Monitoring, Databricks SQL dashboards, PagerDuty, and cloud-native logs/metrics already in place
- Scope: You may improve one tier-1 production service and its top 3 dependent jobs only
Complications
- Alert noise is already high: 38% of pages in the last 30 days were non-actionable.
- The service team wants deep custom telemetry, but adding instrumentation may delay the enterprise launch by 2 weeks.
- Historical incident data is incomplete because two prior sev-2 incidents were documented inconsistently.
Deliverables
- Create an 8-week execution plan with milestones, owners, and dependency management.
- Define the minimum viable observability scope for launch and what you would defer.
- Propose alerting, dashboard, and runbook changes using Databricks-native surfaces where possible.
- Specify launch success metrics and a 30-day post-launch monitoring plan.
- Identify the top risks, trade-offs, and rollback criteria if the rollout increases alert volume or system overhead.