Reduce Hallucinations in Answers

Scenario

You are improving an LLM-powered assistant that answers questions about financing policies, underwriting guidelines, and customer account workflows for internal users. The current system produces fluent answers, but reviewers have found that some responses include unsupported claims or confidently fill in missing details. Usage is growing to a few thousand questions per day, and incorrect answers can create operational and compliance risk. You need to reduce hallucinations without making the product too slow or too expensive.

Constraints

p95 latency must stay under 2,500ms
Average cost must stay under $0.03 per request
Unsupported factual claims must be below 2% on a labeled evaluation set
The system must refuse or escalate when evidence is missing
Retrieved content may contain prompt-injection attempts or sensitive data

Available Resources

A corpus of internal policy documents, SOPs, and knowledge-base articles
Access to a GPT-4-class model and a smaller lower-cost model
A vector store with metadata filtering and keyword search
Capacity to label ~300 evaluation questions and failure cases per quarter

Question

How would you redesign this application to reduce hallucinations in production while staying within the latency and cost limits? Explain the approach you would take to grounding, prompting, evaluation, and runtime safeguards, including how you would handle missing evidence and prompt-injection risk.

Scenario

Constraints

p95 latency must stay under 2,500ms
Average cost must stay under $0.03 per request
Unsupported factual claims must be below 2% on a labeled evaluation set
The system must refuse or escalate when evidence is missing
Retrieved content may contain prompt-injection attempts or sensitive data

Available Resources

A corpus of internal policy documents, SOPs, and knowledge-base articles
Access to a GPT-4-class model and a smaller lower-cost model
A vector store with metadata filtering and keyword search
Capacity to label ~300 evaluation questions and failure cases per quarter

Question

Scenario

Constraints

p95 latency must stay under 2,500ms
Average cost must stay under $0.03 per request
Unsupported factual claims must be below 2% on a labeled evaluation set
The system must refuse or escalate when evidence is missing
Retrieved content may contain prompt-injection attempts or sensitive data

Available Resources

A corpus of internal policy documents, SOPs, and knowledge-base articles
Access to a GPT-4-class model and a smaller lower-cost model
A vector store with metadata filtering and keyword search
Capacity to label ~300 evaluation questions and failure cases per quarter

Question

Scenario

Constraints

p95 latency must stay under 2,500ms
Average cost must stay under $0.03 per request
Unsupported factual claims must be below 2% on a labeled evaluation set
The system must refuse or escalate when evidence is missing
Retrieved content may contain prompt-injection attempts or sensitive data

Available Resources

A corpus of internal policy documents, SOPs, and knowledge-base articles
Access to a GPT-4-class model and a smaller lower-cost model
A vector store with metadata filtering and keyword search
Capacity to label ~300 evaluation questions and failure cases per quarter

Interview Guides

Scenario

Constraints

Available Resources

Question

Reduce Hallucinations in Answers

Scenario

Constraints

Available Resources

Question

Your Answer

Reduce Hallucinations in Answers

Scenario

Constraints

Available Resources

Question

Reduce Hallucinations in Answers

Scenario

Constraints

Available Resources

Question

Your Answer