Spring Health wants to deploy an internal clinical copilot that assists care navigators and clinicians by answering questions from intake notes, prior assessments, care plans, and provider documentation. Because the output may influence mental health care decisions, the system must use LLMs safely: manage long context, structure prompts well, and reduce hallucinations.
You are given approximately 1.8M de-identified clinical documents from Spring Health workflows, including intake questionnaires, therapist notes, care plans, and referral summaries. Documents range from 50 to 6,000 tokens (median: 780), are mostly English, and contain domain-specific terminology, abbreviations, medication mentions, diagnoses, symptom descriptions, and risk language. A smaller evaluation set of 12,000 clinician-authored Q&A pairs includes labels for whether the answer is fully supported, partially supported, or unsupported by source documents.
A good solution should achieve high answer supportability, with at least 90% precision on fully supported answers, <2% unsupported answers on high-risk prompts, and p95 latency under 2.5 seconds for interactive use in Spring Health’s internal clinician surface.