Context
MediLedger, a healthcare payments platform, currently ingests claims, payment, and member eligibility data through nightly SFTP drops and ad hoc API pulls into an AWS data lake. The existing batch-only flow cannot meet new requirements for near-real-time fraud detection, downstream partner routing, and stricter handling of PHI/PII under HIPAA and SOC 2.
You need to design a secure pipeline that ingests sensitive data from multiple sources, validates and transforms it, and routes it to the correct downstream systems with strong auditability and low operational overhead.
Scale Requirements
- Sources: 120 hospital/insurer feeds via SFTP, REST APIs, and Kafka
- Throughput: 150K records/sec peak, 25K avg
- Payload size: 4-12 KB JSON/CSV/Avro per record
- Daily volume: ~3.5 TB raw, ~1.1B records/day
- Latency target: P95 < 2 minutes from ingestion to downstream availability
- Retention: 7 years encrypted archive, 30 days hot storage
Requirements
- Ingest batch and streaming data from SFTP, REST, and Kafka with schema enforcement.
- Encrypt data in transit and at rest, with field-level protection for PHI/PII.
- Validate, deduplicate, standardize, and enrich records before routing.
- Route data to multiple sinks: Snowflake for analytics, S3 for archive, and partner APIs/internal services for operational use.
- Support idempotent reprocessing, backfills, and replay without duplicate downstream delivery.
- Provide lineage, audit logs, and access controls for all sensitive datasets.
- Define monitoring, alerting, and failure recovery for ingestion, transformation, and delivery stages.
Constraints
- AWS is the mandated cloud; prefer managed services where possible.
- Compliance: HIPAA, SOC 2, encryption with AWS KMS, least-privilege IAM, and full audit trails.
- Team size: 5 data engineers, 1 platform engineer.
- Budget target: incremental platform cost under $40K/month.
- Some downstream partners only support rate-limited REST delivery and require exactly-once business semantics.