Reduce LLM Hallucinations in Support Chat

Business Context

AcmeCloud uses an internal LLM assistant to answer customer support agents' questions about product features, billing rules, and API behavior. The team wants a prompt engineering and response-validation workflow that reduces hallucinated answers while preserving fast response times.

Data

Volume: 180,000 historical support Q&A pairs and 25,000 knowledge base articles
Text length: user prompts range from 8-220 words; source documents range from 50-2,000 words
Language: English only
Labels: each answer is tagged as grounded, partially_grounded, or hallucinated; distribution is 68%, 21%, and 11%
Noise: duplicated tickets, outdated docs, and inconsistent product naming across teams

Success Criteria

A good solution should reduce hallucinated responses by at least 40% relative to the current baseline prompt, achieve ≥0.85 macro-F1 on hallucination-risk classification, and keep end-to-end latency under 1.5 seconds per request.

Constraints

Must run in a private VPC; no external API calls during inference
Inference budget is limited to one 16GB GPU and CPU-based retrieval
Responses must cite supporting passages when confidence is low
Prompt templates must be easy for non-ML support teams to edit

Requirements

Define prompt engineering in practical terms for this system.
Build a pipeline that retrieves relevant context, constructs a grounded prompt, and classifies whether the generated answer is likely hallucinated.
Implement preprocessing for support tickets and knowledge base articles.
Fine-tune a lightweight transformer classifier to detect hallucination risk from prompt, retrieved context, and model answer.
Propose prompt changes and guardrails that reduce unsupported claims.
Describe how you would evaluate answer quality, grounding, and failure modes in production.

Business Context

Data

Volume: 180,000 historical support Q&A pairs and 25,000 knowledge base articles
Text length: user prompts range from 8-220 words; source documents range from 50-2,000 words
Language: English only
Labels: each answer is tagged as grounded, partially_grounded, or hallucinated; distribution is 68%, 21%, and 11%
Noise: duplicated tickets, outdated docs, and inconsistent product naming across teams

Success Criteria

Constraints

Must run in a private VPC; no external API calls during inference
Inference budget is limited to one 16GB GPU and CPU-based retrieval
Responses must cite supporting passages when confidence is low
Prompt templates must be easy for non-ML support teams to edit

Requirements

Define prompt engineering in practical terms for this system.
Build a pipeline that retrieves relevant context, constructs a grounded prompt, and classifies whether the generated answer is likely hallucinated.
Implement preprocessing for support tickets and knowledge base articles.
Fine-tune a lightweight transformer classifier to detect hallucination risk from prompt, retrieved context, and model answer.
Propose prompt changes and guardrails that reduce unsupported claims.
Describe how you would evaluate answer quality, grounding, and failure modes in production.

Business Context

Data

Volume: 180,000 historical support Q&A pairs and 25,000 knowledge base articles
Text length: user prompts range from 8-220 words; source documents range from 50-2,000 words
Language: English only
Labels: each answer is tagged as grounded, partially_grounded, or hallucinated; distribution is 68%, 21%, and 11%
Noise: duplicated tickets, outdated docs, and inconsistent product naming across teams

Success Criteria

Constraints

Must run in a private VPC; no external API calls during inference
Inference budget is limited to one 16GB GPU and CPU-based retrieval
Responses must cite supporting passages when confidence is low
Prompt templates must be easy for non-ML support teams to edit

Requirements

Define prompt engineering in practical terms for this system.
Build a pipeline that retrieves relevant context, constructs a grounded prompt, and classifies whether the generated answer is likely hallucinated.
Implement preprocessing for support tickets and knowledge base articles.
Fine-tune a lightweight transformer classifier to detect hallucination risk from prompt, retrieved context, and model answer.
Propose prompt changes and guardrails that reduce unsupported claims.
Describe how you would evaluate answer quality, grounding, and failure modes in production.

Business Context

Data

Volume: 180,000 historical support Q&A pairs and 25,000 knowledge base articles
Text length: user prompts range from 8-220 words; source documents range from 50-2,000 words
Language: English only
Labels: each answer is tagged as grounded, partially_grounded, or hallucinated; distribution is 68%, 21%, and 11%
Noise: duplicated tickets, outdated docs, and inconsistent product naming across teams

Success Criteria

Constraints

Must run in a private VPC; no external API calls during inference
Inference budget is limited to one 16GB GPU and CPU-based retrieval
Responses must cite supporting passages when confidence is low
Prompt templates must be easy for non-ML support teams to edit

Requirements

Define prompt engineering in practical terms for this system.
Build a pipeline that retrieves relevant context, constructs a grounded prompt, and classifies whether the generated answer is likely hallucinated.
Implement preprocessing for support tickets and knowledge base articles.
Fine-tune a lightweight transformer classifier to detect hallucination risk from prompt, retrieved context, and model answer.
Propose prompt changes and guardrails that reduce unsupported claims.
Describe how you would evaluate answer quality, grounding, and failure modes in production.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Reduce LLM Hallucinations in Support Chat

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Reduce LLM Hallucinations in Support Chat

Business Context

Data

Success Criteria

Constraints

Requirements

Reduce LLM Hallucinations in Support Chat

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer