Reduce Hallucinations in Production Assistant

Context

FinGlide, a personal finance app, has an LLM-powered support assistant that answers user questions about account features, fees, transaction disputes, and card policies. After launch, the team found that some answers are plausible but incorrect, especially when policy details changed recently or when the model answers beyond the provided knowledge base.

Constraints

p95 latency must stay under 2,000ms
Cost ceiling: $12K/month at 300K requests/month
Hallucination rate must be reduced to <2% on a production-like golden set
The assistant must refuse or escalate when evidence is insufficient
Responses must not leak PII or follow malicious instructions embedded in retrieved content
Existing UX expects a direct answer plus 1-3 cited sources

Available Resources

25K internal help-center articles, policy docs, and change logs
3 months of production logs with user query, retrieved docs, model answer, user feedback, and escalation outcome
A small labeled set of 800 historical responses marked as correct / incorrect / unsupported
Access to OpenAI models, embeddings, and a managed vector store
Ability to change prompts, retrieval, routing, guardrails, and fallback behavior, but not the frontend experience

Task

Propose an eval-first plan to diagnose why hallucinations are happening in production, including how you would separate retrieval failures, prompt failures, and model failures.
Design a production mitigation strategy that improves factuality without breaking the latency and cost budgets. Be explicit about prompt changes, retrieval changes, refusal behavior, and when to escalate.
Define offline and online evaluation, including golden-set design, adversarial testing for prompt injection, and rollout metrics.
Provide a sample system prompt and a Python implementation for the answer pipeline, including structured output and basic citation validation.
Identify key failure modes, safety risks, and the main cost/latency tradeoffs in your design.

Context

Constraints

p95 latency must stay under 2,000ms
Cost ceiling: $12K/month at 300K requests/month
Hallucination rate must be reduced to <2% on a production-like golden set
The assistant must refuse or escalate when evidence is insufficient
Responses must not leak PII or follow malicious instructions embedded in retrieved content
Existing UX expects a direct answer plus 1-3 cited sources

Available Resources

25K internal help-center articles, policy docs, and change logs
3 months of production logs with user query, retrieved docs, model answer, user feedback, and escalation outcome
A small labeled set of 800 historical responses marked as correct / incorrect / unsupported
Access to OpenAI models, embeddings, and a managed vector store
Ability to change prompts, retrieval, routing, guardrails, and fallback behavior, but not the frontend experience

Task

Propose an eval-first plan to diagnose why hallucinations are happening in production, including how you would separate retrieval failures, prompt failures, and model failures.
Design a production mitigation strategy that improves factuality without breaking the latency and cost budgets. Be explicit about prompt changes, retrieval changes, refusal behavior, and when to escalate.
Define offline and online evaluation, including golden-set design, adversarial testing for prompt injection, and rollout metrics.
Provide a sample system prompt and a Python implementation for the answer pipeline, including structured output and basic citation validation.
Identify key failure modes, safety risks, and the main cost/latency tradeoffs in your design.

Context

Constraints

p95 latency must stay under 2,000ms
Cost ceiling: $12K/month at 300K requests/month
Hallucination rate must be reduced to <2% on a production-like golden set
The assistant must refuse or escalate when evidence is insufficient
Responses must not leak PII or follow malicious instructions embedded in retrieved content
Existing UX expects a direct answer plus 1-3 cited sources

Available Resources

25K internal help-center articles, policy docs, and change logs
3 months of production logs with user query, retrieved docs, model answer, user feedback, and escalation outcome
A small labeled set of 800 historical responses marked as correct / incorrect / unsupported
Access to OpenAI models, embeddings, and a managed vector store
Ability to change prompts, retrieval, routing, guardrails, and fallback behavior, but not the frontend experience

Task

Propose an eval-first plan to diagnose why hallucinations are happening in production, including how you would separate retrieval failures, prompt failures, and model failures.
Design a production mitigation strategy that improves factuality without breaking the latency and cost budgets. Be explicit about prompt changes, retrieval changes, refusal behavior, and when to escalate.
Define offline and online evaluation, including golden-set design, adversarial testing for prompt injection, and rollout metrics.
Provide a sample system prompt and a Python implementation for the answer pipeline, including structured output and basic citation validation.
Identify key failure modes, safety risks, and the main cost/latency tradeoffs in your design.

Context

Constraints

p95 latency must stay under 2,000ms
Cost ceiling: $12K/month at 300K requests/month
Hallucination rate must be reduced to <2% on a production-like golden set
The assistant must refuse or escalate when evidence is insufficient
Responses must not leak PII or follow malicious instructions embedded in retrieved content
Existing UX expects a direct answer plus 1-3 cited sources

Available Resources

25K internal help-center articles, policy docs, and change logs
3 months of production logs with user query, retrieved docs, model answer, user feedback, and escalation outcome
A small labeled set of 800 historical responses marked as correct / incorrect / unsupported
Access to OpenAI models, embeddings, and a managed vector store
Ability to change prompts, retrieval, routing, guardrails, and fallback behavior, but not the frontend experience

Task

Propose an eval-first plan to diagnose why hallucinations are happening in production, including how you would separate retrieval failures, prompt failures, and model failures.
Design a production mitigation strategy that improves factuality without breaking the latency and cost budgets. Be explicit about prompt changes, retrieval changes, refusal behavior, and when to escalate.
Define offline and online evaluation, including golden-set design, adversarial testing for prompt injection, and rollout metrics.
Provide a sample system prompt and a Python implementation for the answer pipeline, including structured output and basic citation validation.
Identify key failure modes, safety risks, and the main cost/latency tradeoffs in your design.

Interview Guides

Context

Constraints

Available Resources

Task

Reduce Hallucinations in Production Assistant

Context

Constraints

Available Resources

Task

Your Answer

Reduce Hallucinations in Production Assistant

Context

Constraints

Available Resources

Task

Reduce Hallucinations in Production Assistant

Context

Constraints

Available Resources

Task

Your Answer