Context
FinMate has an AI-powered support assistant that answers customer questions about billing, refunds, and account settings using a RAG pipeline over help-center content and internal policy docs. Over the last 72 hours, CSAT on AI-resolved conversations dropped from 4.4 to 3.7, while human escalation rate increased from 12% to 21%.
Constraints
- p95 end-to-end latency must remain under 2,500ms
- Cost ceiling: $0.035 per request and $45K/month at current volume
- Hallucination rate must stay below 2% on a labeled golden set
- Prompt injection success rate must be near 0% on adversarial tests
- You cannot pause traffic entirely; only partial rollback, shadowing, or canary changes are allowed
Available Resources
- Request/response logs for the last 30 days, including prompts, retrieved chunks, model/version, latency, token counts, and user feedback
- A 1,000-example golden set with labels for answer quality, groundedness, citation support, refusal correctness, and policy compliance
- Recent changes: a model version upgrade, retriever index refresh, prompt edit, and new policy documents
- Access to approved LLMs, embeddings, vector search, BM25, reranking, and offline replay infrastructure
Task
- Propose a step-by-step investigation plan to identify the root cause of the quality drop, including how you would isolate whether the issue comes from prompts, retrieval, model behavior, data freshness, or safety failures.
- Define the offline and online evaluation framework you would use before making architecture or prompt changes, including key slices, guardrails, and regression thresholds.
- Recommend short-term mitigations and a longer-term prevention plan, covering rollback/canary strategy, monitoring, and alerting.
- Describe how you would reason about cost and latency while debugging, especially if higher-quality mitigations increase retrieval depth or model size.
- Provide a production-quality Python sketch for replaying recent traffic, scoring outputs with structured evaluation, and comparing candidate fixes.