Project Context
Databricks is migrating a large enterprise customer from legacy Hive metastore access patterns to Unity Catalog across 1,200 production data assets and 180 scheduled jobs. The customer has a firm go-live date in 10 weeks because their internal audit deadline requires centralized governance, but their platform team also expects no material regression in job reliability or query performance.
You are the DevOps Engineer leading execution with a cross-functional team of 9: 3 platform engineers, 2 data engineers, 1 security engineer, 1 solutions architect, 1 QA engineer, and 1 customer success manager. Leadership wants the migration completed this quarter to secure a 3-year renewal worth $4.5M ARR.
Key Stakeholders
The customer CISO wants strict access controls and full auditability in Unity Catalog before any production cutover. The customer analytics lead wants existing Databricks SQL dashboards and ETL jobs to run within 5% of current latency. Your internal account team is pushing for the earliest possible launch to support renewal timing, while the engineering manager is concerned about reliability and on-call load during rollout.
Constraints
- Timeline: 10 weeks, with production cutover no later than September 15
- Budget: $140,000 for migration support, temporary QA, and performance testing
- Team capacity: no additional headcount; 2 engineers are each only available 50% due to other customer escalations
- Dependencies: customer IAM team must deliver final SCIM group mappings by Week 3; 35 critical jobs depend on external data shares that cannot be modified until Week 5
- SLOs: maintain 99.9% scheduled job success during migration and keep P95 query latency regression under 5%
Complications
- In Week 2, early testing shows 14% slower P95 query latency on several high-usage Databricks SQL dashboards after enabling new governance policies.
- The account executive has told the customer that all workloads can move in a single cutover weekend, but your team believes a phased rollout is safer.
- A recent Sev-2 incident has made the customer highly sensitive to rollback readiness.
Deliverables
- Build a 10-week execution plan with milestones, owners, and dependency management.
- Recommend how to balance reliability, performance, and delivery speed, including what you would phase versus defer.
- Define launch criteria, rollback triggers, and post-cutover monitoring in Databricks.
- Identify the top risks and mitigation actions, including stakeholder communication points.
- Explain how you would handle the pressure for a single-weekend launch if the evidence does not support it.