Investigate LLM Quality Regression

Context

FinMate has an AI-powered support assistant that answers customer questions about billing, refunds, and account settings using a RAG pipeline over help-center content and internal policy docs. Over the last 72 hours, CSAT on AI-resolved conversations dropped from 4.4 to 3.7, while human escalation rate increased from 12% to 21%.

Constraints

p95 end-to-end latency must remain under 2,500ms
Cost ceiling: $0.035 per request and $45K/month at current volume
Hallucination rate must stay below 2% on a labeled golden set
Prompt injection success rate must be near 0% on adversarial tests
You cannot pause traffic entirely; only partial rollback, shadowing, or canary changes are allowed

Available Resources

Request/response logs for the last 30 days, including prompts, retrieved chunks, model/version, latency, token counts, and user feedback
A 1,000-example golden set with labels for answer quality, groundedness, citation support, refusal correctness, and policy compliance
Recent changes: a model version upgrade, retriever index refresh, prompt edit, and new policy documents
Access to approved LLMs, embeddings, vector search, BM25, reranking, and offline replay infrastructure

Task

Propose a step-by-step investigation plan to identify the root cause of the quality drop, including how you would isolate whether the issue comes from prompts, retrieval, model behavior, data freshness, or safety failures.
Define the offline and online evaluation framework you would use before making architecture or prompt changes, including key slices, guardrails, and regression thresholds.
Recommend short-term mitigations and a longer-term prevention plan, covering rollback/canary strategy, monitoring, and alerting.
Describe how you would reason about cost and latency while debugging, especially if higher-quality mitigations increase retrieval depth or model size.
Provide a production-quality Python sketch for replaying recent traffic, scoring outputs with structured evaluation, and comparing candidate fixes.

Context

Constraints

p95 end-to-end latency must remain under 2,500ms
Cost ceiling: $0.035 per request and $45K/month at current volume
Hallucination rate must stay below 2% on a labeled golden set
Prompt injection success rate must be near 0% on adversarial tests
You cannot pause traffic entirely; only partial rollback, shadowing, or canary changes are allowed

Available Resources

Request/response logs for the last 30 days, including prompts, retrieved chunks, model/version, latency, token counts, and user feedback
A 1,000-example golden set with labels for answer quality, groundedness, citation support, refusal correctness, and policy compliance
Recent changes: a model version upgrade, retriever index refresh, prompt edit, and new policy documents
Access to approved LLMs, embeddings, vector search, BM25, reranking, and offline replay infrastructure

Task

Propose a step-by-step investigation plan to identify the root cause of the quality drop, including how you would isolate whether the issue comes from prompts, retrieval, model behavior, data freshness, or safety failures.
Define the offline and online evaluation framework you would use before making architecture or prompt changes, including key slices, guardrails, and regression thresholds.
Recommend short-term mitigations and a longer-term prevention plan, covering rollback/canary strategy, monitoring, and alerting.
Describe how you would reason about cost and latency while debugging, especially if higher-quality mitigations increase retrieval depth or model size.
Provide a production-quality Python sketch for replaying recent traffic, scoring outputs with structured evaluation, and comparing candidate fixes.

Context

Constraints

p95 end-to-end latency must remain under 2,500ms
Cost ceiling: $0.035 per request and $45K/month at current volume
Hallucination rate must stay below 2% on a labeled golden set
Prompt injection success rate must be near 0% on adversarial tests
You cannot pause traffic entirely; only partial rollback, shadowing, or canary changes are allowed

Available Resources

Request/response logs for the last 30 days, including prompts, retrieved chunks, model/version, latency, token counts, and user feedback
A 1,000-example golden set with labels for answer quality, groundedness, citation support, refusal correctness, and policy compliance
Recent changes: a model version upgrade, retriever index refresh, prompt edit, and new policy documents
Access to approved LLMs, embeddings, vector search, BM25, reranking, and offline replay infrastructure

Task

Propose a step-by-step investigation plan to identify the root cause of the quality drop, including how you would isolate whether the issue comes from prompts, retrieval, model behavior, data freshness, or safety failures.
Define the offline and online evaluation framework you would use before making architecture or prompt changes, including key slices, guardrails, and regression thresholds.
Recommend short-term mitigations and a longer-term prevention plan, covering rollback/canary strategy, monitoring, and alerting.
Describe how you would reason about cost and latency while debugging, especially if higher-quality mitigations increase retrieval depth or model size.
Provide a production-quality Python sketch for replaying recent traffic, scoring outputs with structured evaluation, and comparing candidate fixes.

Context

Constraints

p95 end-to-end latency must remain under 2,500ms
Cost ceiling: $0.035 per request and $45K/month at current volume
Hallucination rate must stay below 2% on a labeled golden set
Prompt injection success rate must be near 0% on adversarial tests
You cannot pause traffic entirely; only partial rollback, shadowing, or canary changes are allowed

Available Resources

Request/response logs for the last 30 days, including prompts, retrieved chunks, model/version, latency, token counts, and user feedback
A 1,000-example golden set with labels for answer quality, groundedness, citation support, refusal correctness, and policy compliance
Recent changes: a model version upgrade, retriever index refresh, prompt edit, and new policy documents
Access to approved LLMs, embeddings, vector search, BM25, reranking, and offline replay infrastructure

Task

Propose a step-by-step investigation plan to identify the root cause of the quality drop, including how you would isolate whether the issue comes from prompts, retrieval, model behavior, data freshness, or safety failures.
Define the offline and online evaluation framework you would use before making architecture or prompt changes, including key slices, guardrails, and regression thresholds.
Recommend short-term mitigations and a longer-term prevention plan, covering rollback/canary strategy, monitoring, and alerting.
Describe how you would reason about cost and latency while debugging, especially if higher-quality mitigations increase retrieval depth or model size.
Provide a production-quality Python sketch for replaying recent traffic, scoring outputs with structured evaluation, and comparing candidate fixes.

Interview Guides

Context

Constraints

Available Resources

Task

Investigate LLM Quality Regression

Context

Constraints

Available Resources

Task

Your Answer

Investigate LLM Quality Regression

Context

Constraints

Available Resources

Task

Investigate LLM Quality Regression

Context

Constraints

Available Resources

Task

Your Answer