Context
ShopPulse, a retail analytics company on AWS, currently runs nightly Spark jobs on Amazon EMR to process order, inventory, and clickstream data into S3 and Redshift. Product and operations teams now need sub-minute visibility into checkout failures and inventory changes, so the company must decide when to keep Spark/EMR for batch ETL and when to introduce Kinesis Data Streams and Firehose for streaming ingestion.
You are asked to design a target-state architecture and explain the trade-offs between Spark on EMR and Kinesis/Firehose, including where each fits in the pipeline.
Scale Requirements
- Batch data: 12 TB/day from transactional databases and application logs
- Streaming data: 180K events/sec peak, 40K avg, ~1.5 KB/event JSON
- Latency targets: batch SLA < 2 hours; streaming dashboards < 60 seconds
- Retention: raw data in S3 for 180 days; curated warehouse data for 2 years
- Availability: 99.9% for ingestion path
Requirements
- Design a pipeline that supports both nightly batch ETL and near-real-time event delivery.
- Clearly identify which workloads should use Spark on EMR versus Kinesis Data Streams / Firehose and why.
- Include transformations such as schema validation, deduplication, partitioning, and aggregation.
- Land raw and curated data in AWS storage layers that support replay and downstream analytics.
- Describe orchestration, monitoring, and recovery for failed jobs, late data, and malformed records.
- Show how analysts can query the output in Redshift or Athena with minimal operational overhead.
Constraints
- AWS-native stack preferred; no Kafka unless strongly justified
- Team has strong Spark skills but limited streaming operations experience
- Incremental budget cap: $30K/month
- PII in customer events must be encrypted at rest and in transit
- Solution should minimize custom consumer management where possible