Secure Sensitive Data Routing Pipeline

Context

MediLedger, a healthcare payments platform, currently ingests claims, payment, and member eligibility data through nightly SFTP drops and ad hoc API pulls into an AWS data lake. The existing batch-only flow cannot meet new requirements for near-real-time fraud detection, downstream partner routing, and stricter handling of PHI/PII under HIPAA and SOC 2.

You need to design a secure pipeline that ingests sensitive data from multiple sources, validates and transforms it, and routes it to the correct downstream systems with strong auditability and low operational overhead.

Scale Requirements

Sources: 120 hospital/insurer feeds via SFTP, REST APIs, and Kafka
Throughput: 150K records/sec peak, 25K avg
Payload size: 4-12 KB JSON/CSV/Avro per record
Daily volume: ~3.5 TB raw, ~1.1B records/day
Latency target: P95 < 2 minutes from ingestion to downstream availability
Retention: 7 years encrypted archive, 30 days hot storage

Requirements

Ingest batch and streaming data from SFTP, REST, and Kafka with schema enforcement.
Encrypt data in transit and at rest, with field-level protection for PHI/PII.
Validate, deduplicate, standardize, and enrich records before routing.
Route data to multiple sinks: Snowflake for analytics, S3 for archive, and partner APIs/internal services for operational use.
Support idempotent reprocessing, backfills, and replay without duplicate downstream delivery.
Provide lineage, audit logs, and access controls for all sensitive datasets.
Define monitoring, alerting, and failure recovery for ingestion, transformation, and delivery stages.

Constraints

AWS is the mandated cloud; prefer managed services where possible.
Compliance: HIPAA, SOC 2, encryption with AWS KMS, least-privilege IAM, and full audit trails.
Team size: 5 data engineers, 1 platform engineer.
Budget target: incremental platform cost under $40K/month.
Some downstream partners only support rate-limited REST delivery and require exactly-once business semantics.

Context

Scale Requirements

Sources: 120 hospital/insurer feeds via SFTP, REST APIs, and Kafka
Throughput: 150K records/sec peak, 25K avg
Payload size: 4-12 KB JSON/CSV/Avro per record
Daily volume: ~3.5 TB raw, ~1.1B records/day
Latency target: P95 < 2 minutes from ingestion to downstream availability
Retention: 7 years encrypted archive, 30 days hot storage

Requirements

Ingest batch and streaming data from SFTP, REST, and Kafka with schema enforcement.
Encrypt data in transit and at rest, with field-level protection for PHI/PII.
Validate, deduplicate, standardize, and enrich records before routing.
Route data to multiple sinks: Snowflake for analytics, S3 for archive, and partner APIs/internal services for operational use.
Support idempotent reprocessing, backfills, and replay without duplicate downstream delivery.
Provide lineage, audit logs, and access controls for all sensitive datasets.
Define monitoring, alerting, and failure recovery for ingestion, transformation, and delivery stages.

Constraints

AWS is the mandated cloud; prefer managed services where possible.
Compliance: HIPAA, SOC 2, encryption with AWS KMS, least-privilege IAM, and full audit trails.
Team size: 5 data engineers, 1 platform engineer.
Budget target: incremental platform cost under $40K/month.
Some downstream partners only support rate-limited REST delivery and require exactly-once business semantics.

Context

Scale Requirements

Sources: 120 hospital/insurer feeds via SFTP, REST APIs, and Kafka
Throughput: 150K records/sec peak, 25K avg
Payload size: 4-12 KB JSON/CSV/Avro per record
Daily volume: ~3.5 TB raw, ~1.1B records/day
Latency target: P95 < 2 minutes from ingestion to downstream availability
Retention: 7 years encrypted archive, 30 days hot storage

Requirements

Ingest batch and streaming data from SFTP, REST, and Kafka with schema enforcement.
Encrypt data in transit and at rest, with field-level protection for PHI/PII.
Validate, deduplicate, standardize, and enrich records before routing.
Route data to multiple sinks: Snowflake for analytics, S3 for archive, and partner APIs/internal services for operational use.
Support idempotent reprocessing, backfills, and replay without duplicate downstream delivery.
Provide lineage, audit logs, and access controls for all sensitive datasets.
Define monitoring, alerting, and failure recovery for ingestion, transformation, and delivery stages.

Constraints

AWS is the mandated cloud; prefer managed services where possible.
Compliance: HIPAA, SOC 2, encryption with AWS KMS, least-privilege IAM, and full audit trails.
Team size: 5 data engineers, 1 platform engineer.
Budget target: incremental platform cost under $40K/month.
Some downstream partners only support rate-limited REST delivery and require exactly-once business semantics.

Context

Scale Requirements

Sources: 120 hospital/insurer feeds via SFTP, REST APIs, and Kafka
Throughput: 150K records/sec peak, 25K avg
Payload size: 4-12 KB JSON/CSV/Avro per record
Daily volume: ~3.5 TB raw, ~1.1B records/day
Latency target: P95 < 2 minutes from ingestion to downstream availability
Retention: 7 years encrypted archive, 30 days hot storage

Requirements

Ingest batch and streaming data from SFTP, REST, and Kafka with schema enforcement.
Encrypt data in transit and at rest, with field-level protection for PHI/PII.
Validate, deduplicate, standardize, and enrich records before routing.
Route data to multiple sinks: Snowflake for analytics, S3 for archive, and partner APIs/internal services for operational use.
Support idempotent reprocessing, backfills, and replay without duplicate downstream delivery.
Provide lineage, audit logs, and access controls for all sensitive datasets.
Define monitoring, alerting, and failure recovery for ingestion, transformation, and delivery stages.

Constraints

AWS is the mandated cloud; prefer managed services where possible.
Compliance: HIPAA, SOC 2, encryption with AWS KMS, least-privilege IAM, and full audit trails.
Team size: 5 data engineers, 1 platform engineer.
Budget target: incremental platform cost under $40K/month.
Some downstream partners only support rate-limited REST delivery and require exactly-once business semantics.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Secure Sensitive Data Routing Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Secure Sensitive Data Routing Pipeline

Context

Scale Requirements

Requirements

Constraints

Secure Sensitive Data Routing Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer