Design Real-Time Event Processing Platform

Context

PulsePay, a global payments platform, currently processes transaction and fraud telemetry with hourly batch ETL into Snowflake. This architecture is too slow for fraud detection, operational dashboards, and partner-facing settlement visibility, so the data team needs a real-time pipeline with strong failure recovery and observability.

Data arrives from payment APIs, mobile apps, and internal services as JSON events. The new system must support low-latency enrichment and delivery to analytical and operational consumers without sacrificing correctness.

Scale Requirements

Ingress throughput: 250K events/sec peak, 80K avg
Event size: 1-4 KB JSON
Daily volume: ~12 TB raw compressed
Latency target: P95 end-to-end < 60 seconds, P99 < 3 minutes
Retention: Kafka 7 days, raw lake 180 days, curated warehouse 3 years
Availability: 99.95% for ingestion and processing

Requirements

Design a streaming pipeline that ingests events from APIs and service logs with horizontal scalability.
Validate schemas, deduplicate by event_id, and quarantine malformed or poison messages.
Enrich events with merchant, user, and geo reference data before writing curated outputs.
Deliver data to both a low-cost raw data lake and Snowflake for analytics.
Support replay, backfill, and idempotent reprocessing from durable storage.
Define monitoring for lag, latency, data quality, throughput, and cost.
Explain failure handling for broker outages, processor crashes, schema changes, and downstream warehouse failures.
Describe orchestration for deployments, schema migrations, and periodic reconciliation jobs.

Constraints

AWS is mandatory; existing footprint includes S3, EKS, and Snowflake.
Incremental platform budget is capped at $40K/month.
PCI-sensitive fields must be tokenized before landing in analytics stores.
Team size is small: 5 data engineers, 1 platform engineer.
The design should prefer managed services where operational burden is materially reduced.

Context

Scale Requirements

Ingress throughput: 250K events/sec peak, 80K avg
Event size: 1-4 KB JSON
Daily volume: ~12 TB raw compressed
Latency target: P95 end-to-end < 60 seconds, P99 < 3 minutes
Retention: Kafka 7 days, raw lake 180 days, curated warehouse 3 years
Availability: 99.95% for ingestion and processing

Requirements

Design a streaming pipeline that ingests events from APIs and service logs with horizontal scalability.
Validate schemas, deduplicate by event_id, and quarantine malformed or poison messages.
Enrich events with merchant, user, and geo reference data before writing curated outputs.
Deliver data to both a low-cost raw data lake and Snowflake for analytics.
Support replay, backfill, and idempotent reprocessing from durable storage.
Define monitoring for lag, latency, data quality, throughput, and cost.
Explain failure handling for broker outages, processor crashes, schema changes, and downstream warehouse failures.
Describe orchestration for deployments, schema migrations, and periodic reconciliation jobs.

Constraints

AWS is mandatory; existing footprint includes S3, EKS, and Snowflake.
Incremental platform budget is capped at $40K/month.
PCI-sensitive fields must be tokenized before landing in analytics stores.
Team size is small: 5 data engineers, 1 platform engineer.
The design should prefer managed services where operational burden is materially reduced.

Context

Scale Requirements

Ingress throughput: 250K events/sec peak, 80K avg
Event size: 1-4 KB JSON
Daily volume: ~12 TB raw compressed
Latency target: P95 end-to-end < 60 seconds, P99 < 3 minutes
Retention: Kafka 7 days, raw lake 180 days, curated warehouse 3 years
Availability: 99.95% for ingestion and processing

Requirements

Design a streaming pipeline that ingests events from APIs and service logs with horizontal scalability.
Validate schemas, deduplicate by event_id, and quarantine malformed or poison messages.
Enrich events with merchant, user, and geo reference data before writing curated outputs.
Deliver data to both a low-cost raw data lake and Snowflake for analytics.
Support replay, backfill, and idempotent reprocessing from durable storage.
Define monitoring for lag, latency, data quality, throughput, and cost.
Explain failure handling for broker outages, processor crashes, schema changes, and downstream warehouse failures.
Describe orchestration for deployments, schema migrations, and periodic reconciliation jobs.

Constraints

AWS is mandatory; existing footprint includes S3, EKS, and Snowflake.
Incremental platform budget is capped at $40K/month.
PCI-sensitive fields must be tokenized before landing in analytics stores.
Team size is small: 5 data engineers, 1 platform engineer.
The design should prefer managed services where operational burden is materially reduced.

Context

Scale Requirements

Ingress throughput: 250K events/sec peak, 80K avg
Event size: 1-4 KB JSON
Daily volume: ~12 TB raw compressed
Latency target: P95 end-to-end < 60 seconds, P99 < 3 minutes
Retention: Kafka 7 days, raw lake 180 days, curated warehouse 3 years
Availability: 99.95% for ingestion and processing

Requirements

Design a streaming pipeline that ingests events from APIs and service logs with horizontal scalability.
Validate schemas, deduplicate by event_id, and quarantine malformed or poison messages.
Enrich events with merchant, user, and geo reference data before writing curated outputs.
Deliver data to both a low-cost raw data lake and Snowflake for analytics.
Support replay, backfill, and idempotent reprocessing from durable storage.
Define monitoring for lag, latency, data quality, throughput, and cost.
Explain failure handling for broker outages, processor crashes, schema changes, and downstream warehouse failures.
Describe orchestration for deployments, schema migrations, and periodic reconciliation jobs.

Constraints

AWS is mandatory; existing footprint includes S3, EKS, and Snowflake.
Incremental platform budget is capped at $40K/month.
PCI-sensitive fields must be tokenized before landing in analytics stores.
Team size is small: 5 data engineers, 1 platform engineer.
The design should prefer managed services where operational burden is materially reduced.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Real-Time Event Processing Platform

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Real-Time Event Processing Platform

Context

Scale Requirements

Requirements

Constraints

Design Real-Time Event Processing Platform

Context

Scale Requirements

Requirements

Constraints

Your Answer