Context
ShopWave, a retail marketplace, currently ingests order, inventory, and clickstream events through REST APIs into S3 and runs hourly Spark batch jobs for downstream analytics. The platform now needs sub-minute freshness for fraud detection and inventory updates, and the data team must decide whether to standardize on Apache Kafka, Apache Flink, or both as the core of the new streaming architecture.
Your task is to design the target pipeline and explain how you would choose between Kafka and Flink for this use case rather than treating them as interchangeable tools.
Scale Requirements
- Sources: web/mobile clickstream, order service CDC, inventory service events, payment webhooks
- Peak throughput: 180K events/sec, average 45K events/sec
- Event size: 1-4 KB JSON/Avro
- Latency target: fraud signals < 10 seconds, analytics aggregates < 2 minutes
- Retention: raw events for 30 days, curated warehouse tables for 2 years
- Availability target: 99.9% for ingestion and processing
Requirements
- Design an end-to-end streaming pipeline from producers to serving layers.
- Explain which responsibilities belong to Kafka vs Flink: ingestion, buffering, replay, stateful processing, windowing, joins, and delivery.
- Support exactly-once or effectively-once processing for order and payment events.
- Implement schema validation, deduplication, and dead-letter handling for malformed events.
- Load curated outputs into Snowflake for analytics and Redis for low-latency fraud lookups.
- Describe how the system handles backfills, late-arriving events, and schema evolution.
- Define monitoring, alerting, and recovery procedures.
Constraints
- AWS-first environment; managed services preferred where practical
- Team has prior Kafka experience but limited Flink expertise
- Incremental budget cap: $35K/month
- PCI-related payment events require encryption in transit and at rest
- Existing batch S3 lake must remain available as fallback during migration