Business Context
You are joining HelioTrials, a healthcare SaaS vendor that supports top-20 pharma sponsors running Phase II/III studies. HelioTrials hosts EDC/SDTM/ADaM exports and produces draft Clinical Study Reports (CSRs) for medical writers. The platform processes ~300 studies/year, each with 50–500 sites and 1K–20K subjects. A single incorrect number in a CSR (e.g., AE counts, disposition totals, p-values) can trigger regulatory findings, delay submissions by weeks, and create legal exposure.
The product goal: automatically generate a CSR Results section (tables-to-text narrative) from raw trial datasets and TLF outputs while ensuring every reported number matches the source and is auditable.
Data Characteristics
- Inputs:
- Structured trial data: SDTM/ADaM (SAS XPT/Parquet) for demographics, disposition, efficacy endpoints, labs, vitals, AEs, concomitant meds.
- Precomputed TLF tables (CSV/JSON) for key outputs (e.g., Table 14.3.1 AE summary, Table 11.2.1 disposition).
- Protocol + SAP snippets (PDF-to-text) describing estimands, populations (ITT/SAF/PP), and statistical methods.
- Text outputs: CSR narrative paragraphs, typically 150–600 words per subsection, with heavy domain vocabulary (MedDRA, SOC/PT, LS mean, CI, MMRM, Kaplan–Meier).
- Numeric complexity: counts, percentages, denominators, stratified subgroups, rounding rules, multiplicity adjustments, missingness handling.
- Labeling reality: you have ~1,200 historical CSR paragraphs paired with their source tables and metadata, but they contain occasional human errors and inconsistent phrasing.
Success Criteria
- Numeric fidelity: ≥ 99.9% of extracted numeric claims in generated text exactly match source values (after applying defined rounding/formatting rules).
- Auditability: every numeric claim must include a machine-readable citation to (table_id, row/col coordinates, population, analysis set, timestamped dataset version).
- Writer usefulness: medical writers accept ≥ 60% of generated paragraphs with minor edits (measured in the authoring tool).
Constraints
- Regulatory: 21 CFR Part 11-style audit trails; immutable artifacts; reproducible generation.
- Privacy: data stays in a VPC; no external API calls; PHI/PII must be removed from prompts.
- Latency: interactive drafting in < 10 seconds per subsection.
- Determinism: reruns with same inputs must be stable (or differences must be explainable and logged).
Requirements (Deliverables)
- Propose an end-to-end architecture to generate CSR narrative from structured data (tables + metadata + SAP).
- Design a number-grounding strategy so the model cannot “invent” values (e.g., constrained decoding, slot filling, tool calling).
- Implement an information extraction layer that detects numeric claims in generated text and links them back to sources.
- Define a validation and gating pipeline that blocks outputs with mismatched numbers and provides actionable error messages.
- Provide an evaluation plan (offline + human-in-the-loop) and monitoring plan for production drift.