Business Context
AcmeCloud is deploying a customer support copilot that answers questions about product features, billing, and API behavior from internal documentation. The main risk is hallucination: the model may generate confident but unsupported answers, which can mislead users and increase support escalations.
Data
- Corpus: ~120,000 support articles, release notes, API docs, and policy pages
- Query volume: ~18,000 user questions per day
- Text length: user queries are 5-80 words; documents range from 100-4,000 words
- Language: English only
- Labels available: 25,000 historical Q&A pairs with agent-approved answers; 8,000 manually reviewed examples labeled as
grounded, partially grounded, or hallucinated
- Common failure modes: outdated version references, fabricated feature availability, incorrect pricing/policy details
Success Criteria
A good solution should reduce hallucinated responses by at least 40% relative to the current baseline, achieve ≥0.85 F1 on hallucination detection, and keep end-to-end response latency under 1.5 seconds at p95.
Constraints
- Responses must cite source passages from approved documents
- No fine-tuning on proprietary data outside AcmeCloud infrastructure
- System must degrade safely: abstain or escalate when evidence is weak
- Weekly document refreshes require reproducible indexing and evaluation
Requirements
- Design an NLP pipeline to reduce hallucinations in a retrieval-augmented generation system.
- Add a verification layer that classifies generated answers as grounded, partially grounded, or hallucinated.
- Describe preprocessing for documents, queries, and citations.
- Implement a modern Python solution using transformers and a realistic retrieval pipeline.
- Define offline and online evaluation, including error analysis and safe fallback behavior.