Choose EMR vs Kinesis Pipeline

Context

ShopPulse, a retail analytics company on AWS, currently runs nightly Spark jobs on Amazon EMR to process order, inventory, and clickstream data into S3 and Redshift. Product and operations teams now need sub-minute visibility into checkout failures and inventory changes, so the company must decide when to keep Spark/EMR for batch ETL and when to introduce Kinesis Data Streams and Firehose for streaming ingestion.

You are asked to design a target-state architecture and explain the trade-offs between Spark on EMR and Kinesis/Firehose, including where each fits in the pipeline.

Scale Requirements

Batch data: 12 TB/day from transactional databases and application logs
Streaming data: 180K events/sec peak, 40K avg, ~1.5 KB/event JSON
Latency targets: batch SLA < 2 hours; streaming dashboards < 60 seconds
Retention: raw data in S3 for 180 days; curated warehouse data for 2 years
Availability: 99.9% for ingestion path

Requirements

Design a pipeline that supports both nightly batch ETL and near-real-time event delivery.
Clearly identify which workloads should use Spark on EMR versus Kinesis Data Streams / Firehose and why.
Include transformations such as schema validation, deduplication, partitioning, and aggregation.
Land raw and curated data in AWS storage layers that support replay and downstream analytics.
Describe orchestration, monitoring, and recovery for failed jobs, late data, and malformed records.
Show how analysts can query the output in Redshift or Athena with minimal operational overhead.

Constraints

AWS-native stack preferred; no Kafka unless strongly justified
Team has strong Spark skills but limited streaming operations experience
Incremental budget cap: $30K/month
PII in customer events must be encrypted at rest and in transit
Solution should minimize custom consumer management where possible

Context

You are asked to design a target-state architecture and explain the trade-offs between Spark on EMR and Kinesis/Firehose, including where each fits in the pipeline.

Scale Requirements

Batch data: 12 TB/day from transactional databases and application logs
Streaming data: 180K events/sec peak, 40K avg, ~1.5 KB/event JSON
Latency targets: batch SLA < 2 hours; streaming dashboards < 60 seconds
Retention: raw data in S3 for 180 days; curated warehouse data for 2 years
Availability: 99.9% for ingestion path

Requirements

Design a pipeline that supports both nightly batch ETL and near-real-time event delivery.
Clearly identify which workloads should use Spark on EMR versus Kinesis Data Streams / Firehose and why.
Include transformations such as schema validation, deduplication, partitioning, and aggregation.
Land raw and curated data in AWS storage layers that support replay and downstream analytics.
Describe orchestration, monitoring, and recovery for failed jobs, late data, and malformed records.
Show how analysts can query the output in Redshift or Athena with minimal operational overhead.

Constraints

AWS-native stack preferred; no Kafka unless strongly justified
Team has strong Spark skills but limited streaming operations experience
Incremental budget cap: $30K/month
PII in customer events must be encrypted at rest and in transit
Solution should minimize custom consumer management where possible

Context

You are asked to design a target-state architecture and explain the trade-offs between Spark on EMR and Kinesis/Firehose, including where each fits in the pipeline.

Scale Requirements

Batch data: 12 TB/day from transactional databases and application logs
Streaming data: 180K events/sec peak, 40K avg, ~1.5 KB/event JSON
Latency targets: batch SLA < 2 hours; streaming dashboards < 60 seconds
Retention: raw data in S3 for 180 days; curated warehouse data for 2 years
Availability: 99.9% for ingestion path

Requirements

Design a pipeline that supports both nightly batch ETL and near-real-time event delivery.
Clearly identify which workloads should use Spark on EMR versus Kinesis Data Streams / Firehose and why.
Include transformations such as schema validation, deduplication, partitioning, and aggregation.
Land raw and curated data in AWS storage layers that support replay and downstream analytics.
Describe orchestration, monitoring, and recovery for failed jobs, late data, and malformed records.
Show how analysts can query the output in Redshift or Athena with minimal operational overhead.

Constraints

AWS-native stack preferred; no Kafka unless strongly justified
Team has strong Spark skills but limited streaming operations experience
Incremental budget cap: $30K/month
PII in customer events must be encrypted at rest and in transit
Solution should minimize custom consumer management where possible

Context

You are asked to design a target-state architecture and explain the trade-offs between Spark on EMR and Kinesis/Firehose, including where each fits in the pipeline.

Scale Requirements

Batch data: 12 TB/day from transactional databases and application logs
Streaming data: 180K events/sec peak, 40K avg, ~1.5 KB/event JSON
Latency targets: batch SLA < 2 hours; streaming dashboards < 60 seconds
Retention: raw data in S3 for 180 days; curated warehouse data for 2 years
Availability: 99.9% for ingestion path

Requirements

Design a pipeline that supports both nightly batch ETL and near-real-time event delivery.
Clearly identify which workloads should use Spark on EMR versus Kinesis Data Streams / Firehose and why.
Include transformations such as schema validation, deduplication, partitioning, and aggregation.
Land raw and curated data in AWS storage layers that support replay and downstream analytics.
Describe orchestration, monitoring, and recovery for failed jobs, late data, and malformed records.
Show how analysts can query the output in Redshift or Athena with minimal operational overhead.

Constraints

AWS-native stack preferred; no Kafka unless strongly justified
Team has strong Spark skills but limited streaming operations experience
Incremental budget cap: $30K/month
PII in customer events must be encrypted at rest and in transit
Solution should minimize custom consumer management where possible

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Choose EMR vs Kinesis Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Choose EMR vs Kinesis Pipeline

Context

Scale Requirements

Requirements

Constraints

Choose EMR vs Kinesis Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer