Business Context
AcmeCloud uses an internal LLM assistant to answer customer support agents' questions about product features, billing rules, and API behavior. The team wants a prompt engineering and response-validation workflow that reduces hallucinated answers while preserving fast response times.
Data
- Volume: 180,000 historical support Q&A pairs and 25,000 knowledge base articles
- Text length: user prompts range from 8-220 words; source documents range from 50-2,000 words
- Language: English only
- Labels: each answer is tagged as
grounded, partially_grounded, or hallucinated; distribution is 68%, 21%, and 11%
- Noise: duplicated tickets, outdated docs, and inconsistent product naming across teams
Success Criteria
A good solution should reduce hallucinated responses by at least 40% relative to the current baseline prompt, achieve ≥0.85 macro-F1 on hallucination-risk classification, and keep end-to-end latency under 1.5 seconds per request.
Constraints
- Must run in a private VPC; no external API calls during inference
- Inference budget is limited to one 16GB GPU and CPU-based retrieval
- Responses must cite supporting passages when confidence is low
- Prompt templates must be easy for non-ML support teams to edit
Requirements
- Define prompt engineering in practical terms for this system.
- Build a pipeline that retrieves relevant context, constructs a grounded prompt, and classifies whether the generated answer is likely hallucinated.
- Implement preprocessing for support tickets and knowledge base articles.
- Fine-tune a lightweight transformer classifier to detect hallucination risk from prompt, retrieved context, and model answer.
- Propose prompt changes and guardrails that reduce unsupported claims.
- Describe how you would evaluate answer quality, grounding, and failure modes in production.