Clinical Document QA: Fine-Tune vs RAG

Business Context

Northstar Health wants a production system that answers clinician questions over internal policies, care pathways, discharge instructions, and trial protocols. You must design and evaluate whether a fine-tuned LLM, a retrieval-augmented generation (RAG) pipeline, or a hybrid approach is the better fit for internal clinical document querying.

Data

Corpus: 420,000 internal clinical documents across PDF, DOCX, HTML, and scanned OCR text
Text length: 1 paragraph to 180 pages; median chunkable length is 320 words
Language: English only, but includes abbreviations, ICD/CPT codes, drug names, and templated sections
Question set: 18,000 historical clinician queries with 6,500 answerable QA pairs and document citations
Label distribution: ~55% fact lookup, 25% policy/procedure questions, 12% multi-document synthesis, 8% unanswerable or outdated

Success Criteria

The system is good enough if it achieves grounded answer quality with citation support, answer faithfulness above 90% on answerable questions, top-5 retrieval recall above 92% for RAG-style systems, and median end-to-end latency below 2 seconds for interactive use.

Constraints

HIPAA-compliant deployment in a private VPC; no external API calls
Documents update daily, with versioned policies and retired content
Answers must cite source passages and avoid unsupported medical claims
Budget supports one 24GB GPU for training/inference plus a managed vector store

Requirements

Propose a fine-tuned LLM approach and a RAG pipeline for this corpus.
Compare trade-offs in freshness, hallucination risk, maintenance cost, latency, and citation quality.
Build a realistic preprocessing pipeline for ingestion, chunking, metadata extraction, and de-identification.
Implement a baseline modern Python solution for document indexing, retrieval, generation, and evaluation.
Define an offline evaluation plan and recommend which architecture you would ship first, with justification.

Business Context

Data

Corpus: 420,000 internal clinical documents across PDF, DOCX, HTML, and scanned OCR text

Text length: 1 paragraph to 180 pages; median chunkable length is 320 words

Language: English only, but includes abbreviations, ICD/CPT codes, drug names, and templated sections

Question set: 18,000 historical clinician queries with 6,500 answerable QA pairs and document citations

Label distribution: ~55% fact lookup, 25% policy/procedure questions, 12% multi-document synthesis, 8% unanswerable or outdated

Requirements

Propose a fine-tuned LLM approach and a RAG pipeline for this corpus.

Compare trade-offs in freshness, hallucination risk, maintenance cost, latency, and citation quality.

Build a realistic preprocessing pipeline for ingestion, chunking, metadata extraction, and de-identification.

Implement a baseline modern Python solution for document indexing, retrieval, generation, and evaluation.

Define an offline evaluation plan and recommend which architecture you would ship first, with justification.

Business Context

Data

Corpus: 420,000 internal clinical documents across PDF, DOCX, HTML, and scanned OCR text

Text length: 1 paragraph to 180 pages; median chunkable length is 320 words

Language: English only, but includes abbreviations, ICD/CPT codes, drug names, and templated sections

Question set: 18,000 historical clinician queries with 6,500 answerable QA pairs and document citations

Label distribution: ~55% fact lookup, 25% policy/procedure questions, 12% multi-document synthesis, 8% unanswerable or outdated

Requirements

Propose a fine-tuned LLM approach and a RAG pipeline for this corpus.

Compare trade-offs in freshness, hallucination risk, maintenance cost, latency, and citation quality.

Build a realistic preprocessing pipeline for ingestion, chunking, metadata extraction, and de-identification.

Implement a baseline modern Python solution for document indexing, retrieval, generation, and evaluation.

Define an offline evaluation plan and recommend which architecture you would ship first, with justification.

Business Context

Data

Corpus: 420,000 internal clinical documents across PDF, DOCX, HTML, and scanned OCR text

Text length: 1 paragraph to 180 pages; median chunkable length is 320 words

Language: English only, but includes abbreviations, ICD/CPT codes, drug names, and templated sections

Question set: 18,000 historical clinician queries with 6,500 answerable QA pairs and document citations

Label distribution: ~55% fact lookup, 25% policy/procedure questions, 12% multi-document synthesis, 8% unanswerable or outdated

Requirements

Propose a fine-tuned LLM approach and a RAG pipeline for this corpus.

Compare trade-offs in freshness, hallucination risk, maintenance cost, latency, and citation quality.

Build a realistic preprocessing pipeline for ingestion, chunking, metadata extraction, and de-identification.

Implement a baseline modern Python solution for document indexing, retrieval, generation, and evaluation.

Define an offline evaluation plan and recommend which architecture you would ship first, with justification.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Clinical Document QA: Fine-Tune vs RAG

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Clinical Document QA: Fine-Tune vs RAG

Business Context

Data

Success Criteria

Constraints

Requirements

Clinical Document QA: Fine-Tune vs RAG

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer