Monitor LLM Support Answer Quality

Context

FinFlow has launched an AI assistant that drafts answers for customer-support agents using the company's help center, policy docs, and recent incident notes. The feature is live, and the main problem is not initial launch quality but catching regressions quickly when prompts, retrieval settings, models, or source content change.

Constraints

p95 end-to-end latency: 1,500ms per request
Cost ceiling: $12K/month at 1.2M requests/month
Hallucination rate: <2% on a high-risk policy QA golden set
Prompt-injection success rate from retrieved content: <0.5%
Monitoring must detect meaningful regressions within 30 minutes, without sending all traffic to expensive human review
The system must support rollback decisions for prompt, retriever, reranker, and model changes

Available Resources

80K support articles and policy documents, updated daily
Request/response logs with retrieved chunks, citations, latency, token usage, and user feedback
A 1,200-example labeled golden set covering answer correctness, groundedness, refusal quality, and safety
A smaller adversarial set for prompt injection, unsupported queries, and stale-document cases
Access to one frontier model for periodic judging and one cheaper model for online shadow checks

Task

Design a monitoring strategy for this LLM feature that detects quality regressions quickly across retrieval, generation, safety, cost, and latency.
Define the offline and online evaluation plan, including what should run continuously, what should run on deploy, and what thresholds trigger alerts or rollback.
Specify the telemetry schema and dashboards you would instrument per request, per experiment, and per document segment.
Propose an architecture for near-real-time monitoring, including how you would sample traffic, run automated judges, and control cost.
Identify the top failure modes, how each would be detected, and how the monitoring system distinguishes model regressions from data-quality or retrieval regressions.

Context

Constraints

p95 end-to-end latency: 1,500ms per request
Cost ceiling: $12K/month at 1.2M requests/month
Hallucination rate: <2% on a high-risk policy QA golden set
Prompt-injection success rate from retrieved content: <0.5%
Monitoring must detect meaningful regressions within 30 minutes, without sending all traffic to expensive human review
The system must support rollback decisions for prompt, retriever, reranker, and model changes

Available Resources

80K support articles and policy documents, updated daily
Request/response logs with retrieved chunks, citations, latency, token usage, and user feedback
A 1,200-example labeled golden set covering answer correctness, groundedness, refusal quality, and safety
A smaller adversarial set for prompt injection, unsupported queries, and stale-document cases
Access to one frontier model for periodic judging and one cheaper model for online shadow checks

Task

Design a monitoring strategy for this LLM feature that detects quality regressions quickly across retrieval, generation, safety, cost, and latency.
Define the offline and online evaluation plan, including what should run continuously, what should run on deploy, and what thresholds trigger alerts or rollback.
Specify the telemetry schema and dashboards you would instrument per request, per experiment, and per document segment.
Propose an architecture for near-real-time monitoring, including how you would sample traffic, run automated judges, and control cost.
Identify the top failure modes, how each would be detected, and how the monitoring system distinguishes model regressions from data-quality or retrieval regressions.

Context

Constraints

p95 end-to-end latency: 1,500ms per request
Cost ceiling: $12K/month at 1.2M requests/month
Hallucination rate: <2% on a high-risk policy QA golden set
Prompt-injection success rate from retrieved content: <0.5%
Monitoring must detect meaningful regressions within 30 minutes, without sending all traffic to expensive human review
The system must support rollback decisions for prompt, retriever, reranker, and model changes

Available Resources

80K support articles and policy documents, updated daily
Request/response logs with retrieved chunks, citations, latency, token usage, and user feedback
A 1,200-example labeled golden set covering answer correctness, groundedness, refusal quality, and safety
A smaller adversarial set for prompt injection, unsupported queries, and stale-document cases
Access to one frontier model for periodic judging and one cheaper model for online shadow checks

Task

Design a monitoring strategy for this LLM feature that detects quality regressions quickly across retrieval, generation, safety, cost, and latency.
Define the offline and online evaluation plan, including what should run continuously, what should run on deploy, and what thresholds trigger alerts or rollback.
Specify the telemetry schema and dashboards you would instrument per request, per experiment, and per document segment.
Propose an architecture for near-real-time monitoring, including how you would sample traffic, run automated judges, and control cost.
Identify the top failure modes, how each would be detected, and how the monitoring system distinguishes model regressions from data-quality or retrieval regressions.

Context

Constraints

p95 end-to-end latency: 1,500ms per request
Cost ceiling: $12K/month at 1.2M requests/month
Hallucination rate: <2% on a high-risk policy QA golden set
Prompt-injection success rate from retrieved content: <0.5%
Monitoring must detect meaningful regressions within 30 minutes, without sending all traffic to expensive human review
The system must support rollback decisions for prompt, retriever, reranker, and model changes

Available Resources

80K support articles and policy documents, updated daily
Request/response logs with retrieved chunks, citations, latency, token usage, and user feedback
A 1,200-example labeled golden set covering answer correctness, groundedness, refusal quality, and safety
A smaller adversarial set for prompt injection, unsupported queries, and stale-document cases
Access to one frontier model for periodic judging and one cheaper model for online shadow checks

Task

Design a monitoring strategy for this LLM feature that detects quality regressions quickly across retrieval, generation, safety, cost, and latency.
Define the offline and online evaluation plan, including what should run continuously, what should run on deploy, and what thresholds trigger alerts or rollback.
Specify the telemetry schema and dashboards you would instrument per request, per experiment, and per document segment.
Propose an architecture for near-real-time monitoring, including how you would sample traffic, run automated judges, and control cost.
Identify the top failure modes, how each would be detected, and how the monitoring system distinguishes model regressions from data-quality or retrieval regressions.

Interview Guides

Context

Constraints

Available Resources

Task

Monitor LLM Support Answer Quality

Context

Constraints

Available Resources

Task

Your Answer

Monitor LLM Support Answer Quality

Context

Constraints

Available Resources

Task

Monitor LLM Support Answer Quality

Context

Constraints

Available Resources

Task

Your Answer