Context
NorthStar Health operates 120 outpatient clinics and stores historical patient encounter, medication, lab, and billing data in a legacy on-prem EMR backed by SQL Server, HL7 message feeds, and nightly CSV exports. The analytics team is migrating to an AWS lakehouse with Snowflake, but legacy schemas, inconsistent coding standards, and unreliable batch extracts make downstream reporting and cohort analysis difficult.
You are asked to design a pipeline that integrates legacy EMR data into the modern platform while preserving auditability and supporting both daily batch analytics and near-real-time operational dashboards.
Scale Requirements
- Source systems: 1 primary EMR database, 6 ancillary systems, HL7 ADT/ORM/ORU feeds
- Volume: 180M historical encounters, 2.5B lab result rows, 25 TB raw backfill
- Incremental load: 8M changed rows/day, 150 GB/day new data
- Streaming: up to 1,500 HL7 messages/sec peak
- Latency target: batch data available within 2 hours; HL7-derived operational tables within 5 minutes
- Retention: 7 years online, immutable audit history for all corrections
Requirements
- Ingest historical EMR tables and ongoing CDC changes from SQL Server into AWS storage without losing source lineage.
- Parse HL7 feeds and normalize them into canonical patient, encounter, provider, diagnosis, procedure, medication, and lab models.
- Handle common legacy issues: duplicate patient identifiers, late corrections, code-set drift (ICD-9/10, CPT, local lab codes), null-heavy columns, and inconsistent timestamps/time zones.
- Support idempotent reprocessing, selective backfills by facility/date range, and schema evolution with minimal downstream breakage.
- Load curated data into Snowflake for BI and regulatory reporting, with data quality checks and reconciliation against source extracts.
- Provide monitoring, alerting, and failure recovery for both batch and streaming paths.
Constraints
- AWS is the required cloud; existing team already uses Airflow, dbt, and Snowflake.
- PHI must remain encrypted in transit and at rest; HIPAA audit logging is mandatory.
- Budget limits favor managed services over large always-on clusters.
- Source EMR allows only limited read replicas and a 4-hour nightly extraction window.