You’re building an internal RAG-based assistant for LedgerLine, a fintech that processes $40B/year in card volume and supports 12,000 enterprise merchants. The assistant is used by Customer Support and Risk Ops to answer questions about chargeback rules, payout schedules, and compliance policies. A wrong answer can cause regulatory violations (e.g., mis-stating KYC requirements) or direct financial loss (e.g., incorrect dispute windows). The system must cite sources from an internal policy corpus and should refuse when it cannot ground an answer.
The current RAG pipeline (dense retrieval → top-k chunks → LLM answer) shows frequent hallucinations: plausible-sounding claims not supported by retrieved context, incorrect citations, and answers that ignore missing/contradictory evidence.
You are given historical logs and labels from a 6-week pilot.
| Component | Scale | What it contains | Notes |
|---|---|---|---|
| Policy corpus | 2.3M chunks | Chunked PDFs, HTML policies, runbooks | Avg 220 tokens/chunk; 14% near-duplicates |
| Queries | 480K | User questions + metadata (team, locale, product) | Long-tail; 35% are multi-hop |
| RAG traces | 480K | Retrieved chunk IDs, BM25 + embedding scores, prompt, model output, citations | k=8 retrieved chunks |
| Labels | 60K | Human review: grounded vs hallucinated; citation correctness; refusal appropriateness | Stratified sample; 18% hallucinated |
| Weak signals | 480K | User thumbs-down, follow-up “where is that stated?”, escalation to legal | Noisy but high coverage |
Label schema (per answer):
hallucinated (binary): any factual claim not supported by provided contextcitation_correct (binary): citations point to text that supports the claimshould_refuse (binary): query cannot be answered from corpus or is ambiguous