Context
FinFlow has launched an AI assistant that drafts answers for customer-support agents using the company's help center, policy docs, and recent incident notes. The feature is live, and the main problem is not initial launch quality but catching regressions quickly when prompts, retrieval settings, models, or source content change.
Constraints
- p95 end-to-end latency: 1,500ms per request
- Cost ceiling: $12K/month at 1.2M requests/month
- Hallucination rate: <2% on a high-risk policy QA golden set
- Prompt-injection success rate from retrieved content: <0.5%
- Monitoring must detect meaningful regressions within 30 minutes, without sending all traffic to expensive human review
- The system must support rollback decisions for prompt, retriever, reranker, and model changes
Available Resources
- 80K support articles and policy documents, updated daily
- Request/response logs with retrieved chunks, citations, latency, token usage, and user feedback
- A 1,200-example labeled golden set covering answer correctness, groundedness, refusal quality, and safety
- A smaller adversarial set for prompt injection, unsupported queries, and stale-document cases
- Access to one frontier model for periodic judging and one cheaper model for online shadow checks
Task
- Design a monitoring strategy for this LLM feature that detects quality regressions quickly across retrieval, generation, safety, cost, and latency.
- Define the offline and online evaluation plan, including what should run continuously, what should run on deploy, and what thresholds trigger alerts or rollback.
- Specify the telemetry schema and dashboards you would instrument per request, per experiment, and per document segment.
- Propose an architecture for near-real-time monitoring, including how you would sample traffic, run automated judges, and control cost.
- Identify the top failure modes, how each would be detected, and how the monitoring system distinguishes model regressions from data-quality or retrieval regressions.