Project Background
A Databricks field engineering team is helping a large fintech customer standardize how its data science and ML teams build and deploy GenAI and predictive ML workloads on the Databricks Data Intelligence Platform. Today, teams use inconsistent Apache Spark cluster policies, ad hoc MLflow tracking setups, and separate dev/qa/prod workspaces, which has led to rising compute spend, weak governance, and slow releases.
You are the AI Engineer leading execution for a 12-week program involving 16 people across platform engineering, data science, ML engineering, security, and analytics. The customer wants a production-ready operating model for Spark, pyspark, spark streaming, Mosaic AI experimentation, Databricks Model Serving, and RAG applications using Databricks Vector Search and the Databricks Agent Framework before the next quarterly business review.
Key Stakeholders
The Head of Data Science wants flexible GPU access for DBRX and Foundation Model APIs experimentation. The Platform Engineering Manager wants strict cluster policies, budget caps, and fewer workspace exceptions. The Security Architect requires all assets governed through Databricks Unity Catalog and tighter approval workflows across dev/qa/prod. The VP of Product wants one customer-facing support copilot launched this quarter.
Constraints
- Timeline: 12 weeks
- Budget: $420K remaining this quarter, including compute and contractor support
- Team: 16 people; no net new full-time headcount
- Existing estate: 3 workspaces, 42 active notebooks, 11 MLflow experiments, 2 spark streaming pipelines, 1 RAG pilot
- Target launch: production support copilot serving 8,000 internal agents by week 12
Complications
- Security has paused promotion of new models to prod until Unity Catalog lineage and access controls are standardized.
- Data scientists are already using oversized all-purpose clusters, and monthly Databricks spend is 28% over forecast.
- The support copilot pilot shows inconsistent groundedness and faithfulness in MLflow Agent Evaluation using llm-as-judge.
Your Task
- Define a 12-week execution plan for compute standardization, governance policy rollout, and dev/qa/prod workflow design.
- Propose how you would align stakeholder priorities and handle trade-offs between experimentation speed, cost, and control.
- Specify the environment, approval, and release process for Spark jobs, Mosaic AI assets, Model Serving endpoints, and vector search index updates.
- Identify the top risks, escalation points, and rollback criteria for the production launch.
- Define success metrics for cost, compliance, delivery speed, and launch quality.