Choose Fine-Tuning vs RAG

Scenario

You are improving a finance-domain assistant that helps operations teams draft responses, summarize account notes, and answer questions about policies and workflows. The current version uses retrieval over internal SOPs, underwriting guidelines, and historical support resolutions, but users still report inconsistent tone, weak handling of repetitive classification-style tasks, and occasional unsupported answers. Traffic is expected to reach 20K requests per day, with a mix of knowledge-grounded questions and high-volume templated tasks. You need to decide whether to keep investing in retrieval, fine-tune a model, or use a hybrid approach.

Constraints

p95 latency must stay under 1,500ms
Inference cost must stay under $12K/month at target volume
Unsupported factual claims must remain below 2% on a golden set
The system must resist prompt injection from retrieved content and user input
Responses involving customer or financial data must avoid leaking sensitive information

Available Resources

Internal policy documents, SOPs, and resolved case histories
8,000 labeled examples from prior analyst workflows and customer communications
Access to an approved GPT-4-class model, a cheaper small model, and embeddings API
A vector store with hybrid retrieval support
Budget for one week of expert labeling to build evaluation sets

Question

How would you decide when fine-tuning is the right choice instead of retrieval-augmented generation for this assistant, and what system would you ship first given these constraints?

Scenario

Constraints

p95 latency must stay under 1,500ms

Inference cost must stay under $12K/month at target volume

Unsupported factual claims must remain below 2% on a golden set

The system must resist prompt injection from retrieved content and user input

Responses involving customer or financial data must avoid leaking sensitive information

Available Resources

Internal policy documents, SOPs, and resolved case histories

8,000 labeled examples from prior analyst workflows and customer communications

Access to an approved GPT-4-class model, a cheaper small model, and embeddings API

A vector store with hybrid retrieval support

Budget for one week of expert labeling to build evaluation sets

Scenario

Constraints

p95 latency must stay under 1,500ms

Inference cost must stay under $12K/month at target volume

Unsupported factual claims must remain below 2% on a golden set

The system must resist prompt injection from retrieved content and user input

Responses involving customer or financial data must avoid leaking sensitive information

Available Resources

Internal policy documents, SOPs, and resolved case histories

8,000 labeled examples from prior analyst workflows and customer communications

Access to an approved GPT-4-class model, a cheaper small model, and embeddings API

A vector store with hybrid retrieval support

Budget for one week of expert labeling to build evaluation sets

Scenario

Constraints

p95 latency must stay under 1,500ms

Inference cost must stay under $12K/month at target volume

Unsupported factual claims must remain below 2% on a golden set

The system must resist prompt injection from retrieved content and user input

Responses involving customer or financial data must avoid leaking sensitive information

Available Resources

Internal policy documents, SOPs, and resolved case histories

8,000 labeled examples from prior analyst workflows and customer communications

Access to an approved GPT-4-class model, a cheaper small model, and embeddings API

A vector store with hybrid retrieval support

Budget for one week of expert labeling to build evaluation sets

Interview Guides

Scenario

Constraints

Available Resources

Question

Choose Fine-Tuning vs RAG

Scenario

Constraints

Available Resources

Question

Your Answer

Choose Fine-Tuning vs RAG

Scenario

Constraints

Available Resources

Question

Choose Fine-Tuning vs RAG

Scenario

Constraints

Available Resources

Question

Your Answer