Monitor RAG Retrieval Failure

Context

A Databricks team runs a production RAG assistant for internal support using the Databricks Agent Framework, Databricks Vector Search, DBRX through Databricks Foundation Model APIs, and Databricks Model Serving. Source documents are stored in Delta Lake and governed by Unity Catalog. Over the last 3 weeks, user satisfaction has dropped even though generation latency and model availability are stable.

Offline evaluation with MLflow Agent Evaluation (mlflow.evaluate) suggests the retrieval layer may be degrading: the assistant increasingly answers with plausible but weakly supported content. The team wants a monitoring plan that detects retrieval failures before they materially impact users.

Current Performance

Metric	30 days ago	Current	Change
Retrieval hit@5	0.91	0.74	-0.17
Context precision@5	0.84	0.63	-0.21
Faithfulness	0.88	0.79	-0.09
Groundedness	0.86	0.68	-0.18
LLM-as-judge answer quality	4.3/5	3.7/5	-0.6
No-answer rate	6.1%	14.8%	+8.7 pts
P95 end-to-end latency	2.9s	3.0s	+0.1s

The Problem

You need to design an evaluation and monitoring approach that can distinguish retrieval failures from generation failures, quantify impact, and trigger investigation when the vector search index or retrieval pipeline starts drifting.

Requirements

Identify which metrics are leading indicators of retrieval failure versus lagging user-impact metrics.
Propose an online and offline monitoring design using MLflow, Delta Lake, and Databricks-native evaluation tooling.
Define alert thresholds and how you would reduce false alarms.
Explain how you would segment errors by query type, document source, and index freshness.
Recommend concrete remediation steps if retrieval quality continues to decline.

Constraints

The assistant serves 1.8M queries/day.
Human labeling is limited to 1,000 sampled queries/week.
Source documents update hourly via Apache Spark pipelines into Delta Lake.
The team cannot materially increase P95 latency or serving cost.

Context

Metric

30 days ago

Current

Change

Retrieval hit@5

0.91

0.74

-0.17

Context precision@5

0.84

0.63

-0.21

Faithfulness

0.88

0.79

-0.09

Groundedness

0.86

0.68

-0.18

LLM-as-judge answer quality

4.3/5

3.7/5

-0.6

No-answer rate

6.1%

14.8%

+8.7 pts

P95 end-to-end latency

2.9s

3.0s

+0.1s

Requirements

Identify which metrics are leading indicators of retrieval failure versus lagging user-impact metrics.

Propose an online and offline monitoring design using MLflow, Delta Lake, and Databricks-native evaluation tooling.

Define alert thresholds and how you would reduce false alarms.

Explain how you would segment errors by query type, document source, and index freshness.

Recommend concrete remediation steps if retrieval quality continues to decline.

Context

Metric

30 days ago

Current

Change

Retrieval hit@5

0.91

0.74

-0.17

Context precision@5

0.84

0.63

-0.21

Faithfulness

0.88

0.79

-0.09

Groundedness

0.86

0.68

-0.18

LLM-as-judge answer quality

4.3/5

3.7/5

-0.6

No-answer rate

6.1%

14.8%

+8.7 pts

P95 end-to-end latency

2.9s

3.0s

+0.1s

Requirements

Identify which metrics are leading indicators of retrieval failure versus lagging user-impact metrics.

Propose an online and offline monitoring design using MLflow, Delta Lake, and Databricks-native evaluation tooling.

Define alert thresholds and how you would reduce false alarms.

Explain how you would segment errors by query type, document source, and index freshness.

Recommend concrete remediation steps if retrieval quality continues to decline.

Context

Metric

30 days ago

Current

Change

Retrieval hit@5

0.91

0.74

-0.17

Context precision@5

0.84

0.63

-0.21

Faithfulness

0.88

0.79

-0.09

Groundedness

0.86

0.68

-0.18

LLM-as-judge answer quality

4.3/5

3.7/5

-0.6

No-answer rate

6.1%

14.8%

+8.7 pts

P95 end-to-end latency

2.9s

3.0s

+0.1s

Requirements

Identify which metrics are leading indicators of retrieval failure versus lagging user-impact metrics.

Propose an online and offline monitoring design using MLflow, Delta Lake, and Databricks-native evaluation tooling.

Define alert thresholds and how you would reduce false alarms.

Explain how you would segment errors by query type, document source, and index freshness.

Recommend concrete remediation steps if retrieval quality continues to decline.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Monitor RAG Retrieval Failure

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Monitor RAG Retrieval Failure

Context

Current Performance

The Problem

Requirements

Constraints

Monitor RAG Retrieval Failure

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer