Context
FlashCart, a grocery delivery platform, currently runs a mixed batch and streaming data platform on AWS. Order events, inventory updates, courier pings, and app telemetry flow into Kafka and are processed by Spark jobs before landing in S3 and Snowflake. During promotions, weather disruptions, or regional outages, event volume can spike 5-8x, causing consumer lag, delayed dashboards, and occasional duplicate loads.
You need to design a resilient pipeline architecture that keeps critical datasets stable during sudden demand spikes while preserving data quality and recovery guarantees.
Scale Requirements
- Normal throughput: 120K events/sec average across all topics
- Spike throughput: 750K events/sec sustained for up to 45 minutes
- Event size: 1-3 KB JSON/Avro
- Daily volume: ~9 TB raw, ~3 TB compressed Parquet
- Latency targets:
- Operational metrics: < 2 minutes end-to-end
- Finance/order facts: < 10 minutes
- Retention: Kafka 7 days, S3 raw 180 days, curated warehouse tables 3 years
- Availability target: 99.9% for ingestion and critical transformations
Requirements
- Design ingestion and processing layers that absorb 750K events/sec without dropping messages.
- Prioritize critical pipelines (orders, payments, inventory) over non-critical telemetry during backpressure.
- Ensure idempotent processing and safe replay for duplicate, late, or retried events.
- Maintain data quality checks for schema validation, null-rate anomalies, and duplicate detection during spikes.
- Define orchestration, autoscaling, and degradation strategies when downstream systems slow down.
- Support backfills and replay from raw storage without corrupting warehouse tables.
- Provide monitoring, alerting, and failure recovery for Kafka, Spark, Airflow, S3, and Snowflake.
Constraints
- AWS is the required cloud; existing services include MSK, EMR, S3, Airflow, and Snowflake.
- Incremental budget increase is capped at $35K/month.
- Team size is 5 data engineers and 1 platform engineer.
- PCI-related payment data must be encrypted in transit and at rest, with restricted access to curated tables.