Reduce Hallucinations in Fintech RAG

Business Context

You’re building an internal RAG-based assistant for LedgerLine, a fintech that processes $40B/year in card volume and supports 12,000 enterprise merchants. The assistant is used by Customer Support and Risk Ops to answer questions about chargeback rules, payout schedules, and compliance policies. A wrong answer can cause regulatory violations (e.g., mis-stating KYC requirements) or direct financial loss (e.g., incorrect dispute windows). The system must cite sources from an internal policy corpus and should refuse when it cannot ground an answer.

The current RAG pipeline (dense retrieval → top-k chunks → LLM answer) shows frequent hallucinations: plausible-sounding claims not supported by retrieved context, incorrect citations, and answers that ignore missing/contradictory evidence.

Dataset

You are given historical logs and labels from a 6-week pilot.

Component	Scale	What it contains	Notes
Policy corpus	2.3M chunks	Chunked PDFs, HTML policies, runbooks	Avg 220 tokens/chunk; 14% near-duplicates
Queries	480K	User questions + metadata (team, locale, product)	Long-tail; 35% are multi-hop
RAG traces	480K	Retrieved chunk IDs, BM25 + embedding scores, prompt, model output, citations	k=8 retrieved chunks
Labels	60K	Human review: grounded vs hallucinated; citation correctness; refusal appropriateness	Stratified sample; 18% hallucinated
Weak signals	480K	User thumbs-down, follow-up “where is that stated?”, escalation to legal	Noisy but high coverage

Label schema (per answer):

hallucinated (binary): any factual claim not supported by provided context
citation_correct (binary): citations point to text that supports the claim
should_refuse (binary): query cannot be answered from corpus or is ambiguous

Success Criteria

Reduce hallucination rate from 18% → < 6% on the human-labeled evaluation set.
Maintain answer coverage: at least 75% of queries should receive a non-refusal answer.
For compliance-sensitive intents (KYC/AML/chargebacks), achieve citation correctness ≥ 95%.
Latency: p95 < 900 ms end-to-end at 120 QPS (steady state), including retrieval.

Constraints

You cannot fine-tune the base LLM in the first iteration (vendor-managed model). You can add smaller models/classifiers.
The corpus updates daily; indexing must support incremental updates.
The system must provide auditable evidence (citations) and support selective abstention.
Budget: additional models must fit within 1× A10 GPU (or CPU-only if feasible) for online scoring.

Deliverables

Propose a concrete hallucination-mitigation design for this RAG system (retrieval, prompting, post-processing, and refusal strategy).
Define a supervised “groundedness / hallucination risk” model using available RAG traces and labels.
Specify features you would engineer from retrieval scores, chunk overlap, citation spans, and generation signals.
Provide an evaluation plan (offline + online), including metrics and thresholding strategy for abstention.
Describe how you would monitor hallucinations in production and detect regressions after corpus or model updates.

Business Context

Dataset

You are given historical logs and labels from a 6-week pilot.

Component	Scale	What it contains	Notes
Policy corpus	2.3M chunks	Chunked PDFs, HTML policies, runbooks	Avg 220 tokens/chunk; 14% near-duplicates
Queries	480K	User questions + metadata (team, locale, product)	Long-tail; 35% are multi-hop
RAG traces	480K	Retrieved chunk IDs, BM25 + embedding scores, prompt, model output, citations	k=8 retrieved chunks
Labels	60K	Human review: grounded vs hallucinated; citation correctness; refusal appropriateness	Stratified sample; 18% hallucinated
Weak signals	480K	User thumbs-down, follow-up “where is that stated?”, escalation to legal	Noisy but high coverage

Label schema (per answer):

hallucinated (binary): any factual claim not supported by provided context
citation_correct (binary): citations point to text that supports the claim
should_refuse (binary): query cannot be answered from corpus or is ambiguous

Success Criteria

Reduce hallucination rate from 18% → < 6% on the human-labeled evaluation set.
Maintain answer coverage: at least 75% of queries should receive a non-refusal answer.
For compliance-sensitive intents (KYC/AML/chargebacks), achieve citation correctness ≥ 95%.
Latency: p95 < 900 ms end-to-end at 120 QPS (steady state), including retrieval.

Constraints

You cannot fine-tune the base LLM in the first iteration (vendor-managed model). You can add smaller models/classifiers.
The corpus updates daily; indexing must support incremental updates.
The system must provide auditable evidence (citations) and support selective abstention.
Budget: additional models must fit within 1× A10 GPU (or CPU-only if feasible) for online scoring.

Deliverables

Propose a concrete hallucination-mitigation design for this RAG system (retrieval, prompting, post-processing, and refusal strategy).
Define a supervised “groundedness / hallucination risk” model using available RAG traces and labels.
Specify features you would engineer from retrieval scores, chunk overlap, citation spans, and generation signals.
Provide an evaluation plan (offline + online), including metrics and thresholding strategy for abstention.
Describe how you would monitor hallucinations in production and detect regressions after corpus or model updates.

Business Context

Dataset

You are given historical logs and labels from a 6-week pilot.

Component	Scale	What it contains	Notes
Policy corpus	2.3M chunks	Chunked PDFs, HTML policies, runbooks	Avg 220 tokens/chunk; 14% near-duplicates
Queries	480K	User questions + metadata (team, locale, product)	Long-tail; 35% are multi-hop
RAG traces	480K	Retrieved chunk IDs, BM25 + embedding scores, prompt, model output, citations	k=8 retrieved chunks
Labels	60K	Human review: grounded vs hallucinated; citation correctness; refusal appropriateness	Stratified sample; 18% hallucinated
Weak signals	480K	User thumbs-down, follow-up “where is that stated?”, escalation to legal	Noisy but high coverage

Label schema (per answer):

hallucinated (binary): any factual claim not supported by provided context
citation_correct (binary): citations point to text that supports the claim
should_refuse (binary): query cannot be answered from corpus or is ambiguous

Success Criteria

Reduce hallucination rate from 18% → < 6% on the human-labeled evaluation set.
Maintain answer coverage: at least 75% of queries should receive a non-refusal answer.
For compliance-sensitive intents (KYC/AML/chargebacks), achieve citation correctness ≥ 95%.
Latency: p95 < 900 ms end-to-end at 120 QPS (steady state), including retrieval.

Constraints

You cannot fine-tune the base LLM in the first iteration (vendor-managed model). You can add smaller models/classifiers.
The corpus updates daily; indexing must support incremental updates.
The system must provide auditable evidence (citations) and support selective abstention.
Budget: additional models must fit within 1× A10 GPU (or CPU-only if feasible) for online scoring.

Deliverables

Propose a concrete hallucination-mitigation design for this RAG system (retrieval, prompting, post-processing, and refusal strategy).
Define a supervised “groundedness / hallucination risk” model using available RAG traces and labels.
Specify features you would engineer from retrieval scores, chunk overlap, citation spans, and generation signals.
Provide an evaluation plan (offline + online), including metrics and thresholding strategy for abstention.
Describe how you would monitor hallucinations in production and detect regressions after corpus or model updates.

Business Context

Dataset

You are given historical logs and labels from a 6-week pilot.

Component	Scale	What it contains	Notes
Policy corpus	2.3M chunks	Chunked PDFs, HTML policies, runbooks	Avg 220 tokens/chunk; 14% near-duplicates
Queries	480K	User questions + metadata (team, locale, product)	Long-tail; 35% are multi-hop
RAG traces	480K	Retrieved chunk IDs, BM25 + embedding scores, prompt, model output, citations	k=8 retrieved chunks
Labels	60K	Human review: grounded vs hallucinated; citation correctness; refusal appropriateness	Stratified sample; 18% hallucinated
Weak signals	480K	User thumbs-down, follow-up “where is that stated?”, escalation to legal	Noisy but high coverage

Label schema (per answer):

hallucinated (binary): any factual claim not supported by provided context
citation_correct (binary): citations point to text that supports the claim
should_refuse (binary): query cannot be answered from corpus or is ambiguous

Success Criteria

Reduce hallucination rate from 18% → < 6% on the human-labeled evaluation set.
Maintain answer coverage: at least 75% of queries should receive a non-refusal answer.
For compliance-sensitive intents (KYC/AML/chargebacks), achieve citation correctness ≥ 95%.
Latency: p95 < 900 ms end-to-end at 120 QPS (steady state), including retrieval.

Constraints

You cannot fine-tune the base LLM in the first iteration (vendor-managed model). You can add smaller models/classifiers.
The corpus updates daily; indexing must support incremental updates.
The system must provide auditable evidence (citations) and support selective abstention.
Budget: additional models must fit within 1× A10 GPU (or CPU-only if feasible) for online scoring.

Deliverables

Propose a concrete hallucination-mitigation design for this RAG system (retrieval, prompting, post-processing, and refusal strategy).
Define a supervised “groundedness / hallucination risk” model using available RAG traces and labels.
Specify features you would engineer from retrieval scores, chunk overlap, citation spans, and generation signals.
Provide an evaluation plan (offline + online), including metrics and thresholding strategy for abstention.
Describe how you would monitor hallucinations in production and detect regressions after corpus or model updates.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Reduce Hallucinations in Fintech RAG

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Reduce Hallucinations in Fintech RAG

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Reduce Hallucinations in Fintech RAG

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer