Context
PulsePay, a global payments platform, currently processes transaction and fraud telemetry with hourly batch ETL into Snowflake. This architecture is too slow for fraud detection, operational dashboards, and partner-facing settlement visibility, so the data team needs a real-time pipeline with strong failure recovery and observability.
Data arrives from payment APIs, mobile apps, and internal services as JSON events. The new system must support low-latency enrichment and delivery to analytical and operational consumers without sacrificing correctness.
Scale Requirements
- Ingress throughput: 250K events/sec peak, 80K avg
- Event size: 1-4 KB JSON
- Daily volume: ~12 TB raw compressed
- Latency target: P95 end-to-end < 60 seconds, P99 < 3 minutes
- Retention: Kafka 7 days, raw lake 180 days, curated warehouse 3 years
- Availability: 99.95% for ingestion and processing
Requirements
- Design a streaming pipeline that ingests events from APIs and service logs with horizontal scalability.
- Validate schemas, deduplicate by
event_id, and quarantine malformed or poison messages.
- Enrich events with merchant, user, and geo reference data before writing curated outputs.
- Deliver data to both a low-cost raw data lake and Snowflake for analytics.
- Support replay, backfill, and idempotent reprocessing from durable storage.
- Define monitoring for lag, latency, data quality, throughput, and cost.
- Explain failure handling for broker outages, processor crashes, schema changes, and downstream warehouse failures.
- Describe orchestration for deployments, schema migrations, and periodic reconciliation jobs.
Constraints
- AWS is mandatory; existing footprint includes S3, EKS, and Snowflake.
- Incremental platform budget is capped at $40K/month.
- PCI-sensitive fields must be tokenized before landing in analytics stores.
- Team size is small: 5 data engineers, 1 platform engineer.
- The design should prefer managed services where operational burden is materially reduced.