
A Databricks team runs a production RAG assistant for internal support using the Databricks Agent Framework, Databricks Vector Search, DBRX through Databricks Foundation Model APIs, and Databricks Model Serving. Source documents are stored in Delta Lake and governed by Unity Catalog. Over the last 3 weeks, user satisfaction has dropped even though generation latency and model availability are stable.
Offline evaluation with MLflow Agent Evaluation (mlflow.evaluate) suggests the retrieval layer may be degrading: the assistant increasingly answers with plausible but weakly supported content. The team wants a monitoring plan that detects retrieval failures before they materially impact users.
| Metric | 30 days ago | Current | Change |
|---|---|---|---|
| Retrieval hit@5 | 0.91 | 0.74 | -0.17 |
| Context precision@5 | 0.84 | 0.63 | -0.21 |
| Faithfulness | 0.88 | 0.79 | -0.09 |
| Groundedness | 0.86 | 0.68 | -0.18 |
| LLM-as-judge answer quality | 4.3/5 | 3.7/5 | -0.6 |
| No-answer rate | 6.1% | 14.8% | +8.7 pts |
| P95 end-to-end latency | 2.9s | 3.0s | +0.1s |
You need to design an evaluation and monitoring approach that can distinguish retrieval failures from generation failures, quantify impact, and trigger investigation when the vector search index or retrieval pipeline starts drifting.