Context
Akido needs a pipeline to ingest patient records from multiple hospital networks into Akido's clinical data platform. Today, partner hospitals deliver data through a mix of HL7 v2 feeds, FHIR APIs, SFTP batch files, and occasional CSV extracts, which creates inconsistent schemas, duplicate patient identities, and delayed downstream availability for Akido care operations.
Design a secure, auditable pipeline that standardizes these records into a canonical patient model usable by Akido products and internal analytics. Assume Akido already operates on AWS and exposes internal data products through Akido Pipelines and Akido's clinical data layer.
Scale Requirements
- Sources: 120 hospital networks, each with 5-20 upstream systems
- Throughput: 15K messages/sec peak, 3K avg across HL7/FHIR events
- Batch volume: 8M patient records/day via files and API pulls
- Payload size: 5-50 KB per record/message
- Latency target: critical ADT updates available in Akido within 2 minutes; batch standardization within 30 minutes of file arrival
- Storage: 25 TB raw/year, 8 TB curated/year
- Retention: 7 years raw + audit logs
Requirements
- Ingest HL7 v2 ADT/ORU messages, FHIR Patient/Encounter resources, and nightly flat files into Akido Pipelines.
- Validate transport security, source authentication, schema conformance, and required PHI fields before records enter curated storage.
- Standardize all inputs into a canonical patient record model with source lineage, version history, and event timestamps.
- Support deduplication and idempotent reprocessing for replayed feeds, duplicate files, and late-arriving updates.
- Route malformed or non-conformant records to quarantine without blocking valid traffic.
- Expose curated records to Akido's downstream operational and analytics surfaces with full auditability.
- Design monitoring, alerting, backfill strategy, and disaster recovery.
Constraints
- Must be HIPAA-compliant with encryption in transit and at rest, least-privilege IAM, and immutable audit logs.
- Budget should avoid overprovisioned always-on compute; prefer autoscaling where possible.
- Some hospital partners can only deliver over SFTP once per day; others require near-real-time ingestion.
- Canonicalization must preserve source fidelity; raw payloads cannot be discarded.