Business Context
Northstar Health wants a production system that answers clinician questions over internal policies, care pathways, discharge instructions, and trial protocols. You must design and evaluate whether a fine-tuned LLM, a retrieval-augmented generation (RAG) pipeline, or a hybrid approach is the better fit for internal clinical document querying.
Data
- Corpus: 420,000 internal clinical documents across PDF, DOCX, HTML, and scanned OCR text
- Text length: 1 paragraph to 180 pages; median chunkable length is 320 words
- Language: English only, but includes abbreviations, ICD/CPT codes, drug names, and templated sections
- Question set: 18,000 historical clinician queries with 6,500 answerable QA pairs and document citations
- Label distribution: ~55% fact lookup, 25% policy/procedure questions, 12% multi-document synthesis, 8% unanswerable or outdated
Success Criteria
The system is good enough if it achieves grounded answer quality with citation support, answer faithfulness above 90% on answerable questions, top-5 retrieval recall above 92% for RAG-style systems, and median end-to-end latency below 2 seconds for interactive use.
Constraints
- HIPAA-compliant deployment in a private VPC; no external API calls
- Documents update daily, with versioned policies and retired content
- Answers must cite source passages and avoid unsupported medical claims
- Budget supports one 24GB GPU for training/inference plus a managed vector store
Requirements
- Propose a fine-tuned LLM approach and a RAG pipeline for this corpus.
- Compare trade-offs in freshness, hallucination risk, maintenance cost, latency, and citation quality.
- Build a realistic preprocessing pipeline for ingestion, chunking, metadata extraction, and de-identification.
- Implement a baseline modern Python solution for document indexing, retrieval, generation, and evaluation.
- Define an offline evaluation plan and recommend which architecture you would ship first, with justification.