Business Context
You’re joining AstraScribe, a healthcare documentation vendor integrated into Epic and Cerner. The product generates patient-facing after-visit summaries (AVS) and clinician discharge instructions from encounter notes. AstraScribe serves 120 hospital systems and produces ~2.5M summaries/month. A recent internal audit found that 0.4% of generated summaries contain at least one unsupported clinical claim (e.g., adding a diagnosis not in the chart, inventing a medication change). Even a single hallucinated instruction can cause patient harm and triggers regulatory and legal exposure (HIPAA, FDA SaMD risk management, malpractice).
Your task is to design an NLP system that prevents hallucinations in high-stakes medical writing while keeping the summaries readable and fast to generate.
Data Characteristics
AstraScribe has a de-identified dataset of 180k encounters with:
- Inputs:
- Encounter note sections: HPI, ROS, Assessment/Plan, Medications, Allergies, Labs, Imaging, Discharge orders
- Structured EHR fields: active problem list, medication list, vitals, ICD-10 codes, lab tables
- Outputs:
- Human-written AVS (gold) for ~60k encounters
- Model-generated AVS + clinician edits for ~120k encounters
- Text length:
- Notes: 400–6,000 tokens (median ~1,800)
- AVS: 120–600 tokens (median ~260)
- Language: English only, but heavy clinical shorthand (e.g., “SOB”, “AFib”, “Cr 1.8”, “d/c”, “PRN”).
- Error labels (from audit + clinician review on 15k samples):
SUPPORTED (no unsupported claims)
UNSUPPORTED_MED_CHANGE
UNSUPPORTED_DIAGNOSIS
UNSUPPORTED_FOLLOWUP
UNSUPPORTED_LAB_INTERPRETATION
Success Criteria
- Reduce hallucination rate from 0.4% → ≤0.1% on a held-out, clinician-reviewed set.
- Maintain readability: clinician satisfaction ≥ 4.3/5 (survey) and AVS length within ±15% of baseline.
- Latency: p95 < 2.0s per encounter on 1× NVIDIA L4 (batch size 1–4).
- Provide evidence traces: every clinical statement in the output must link to supporting spans from the note/EHR fields, or be explicitly flagged as “not found”.
Constraints
- No external internet retrieval at inference time (hospital network restrictions).
- Must be deployable in a HIPAA environment; logs must avoid PHI.
- Model must support deterministic guardrails (hard blocks) for high-risk content (med changes, new diagnoses, dosing instructions).
Requirements (Deliverables)
- Propose an end-to-end approach combining generation with grounding + verification to prevent hallucinations.
- Define a schema for atomic clinical claims (e.g., medication change, diagnosis, follow-up instruction) and how you will extract them from generated text.
- Implement a baseline pipeline that:
- Generates a draft summary (LLM)
- Extracts claims (NER + patterning)
- Verifies each claim against encounter evidence (retrieval over local context + entailment/classifier)
- Either edits unsupported claims out or blocks the output with a clinician escalation message
- Describe how you would fine-tune or adapt models (e.g., BioClinicalBERT/DeBERTa for verification) given limited labeled hallucination data.
- Provide an evaluation plan with safety-first metrics and targeted error analysis.