Modernize Hadoop to Spark Pipelines

Context

ShopWave, a mid-size retail marketplace, currently runs nightly Hadoop MapReduce jobs on HDFS to process order, inventory, and clickstream data for reporting. The platform has grown to support near-real-time operational dashboards and faster analytics, so the data team wants to redesign the pipeline using Spark while preserving batch reliability and adding streaming support.

You are asked to design a production-ready data platform that can process both historical and incremental data using a modern big data stack.

Scale Requirements

Batch volume: 6 TB/day across orders, inventory, and customer events
Streaming volume: 80K events/sec peak from web and mobile applications
Latency target: batch outputs available by 6:00 AM; streaming data queryable within 3 minutes
Retention: raw data for 180 days, curated warehouse tables for 3 years
Data size: average event payload 1.5 KB JSON

Requirements

Design a pipeline that ingests data from OLTP databases, application event streams, and third-party CSV drops.
Use Spark for distributed transformation, including joins, deduplication, schema enforcement, and aggregations.
Support both batch ETL and stream processing paths into a central analytics store.
Orchestrate dependencies, retries, and backfills for daily and hourly jobs.
Implement data quality checks for null keys, duplicate records, schema drift, and late-arriving data.
Provide a strategy for partitioning, storage format, and incremental processing.
Describe monitoring, alerting, and failure recovery for production operations.

Constraints

Existing environment is AWS-based with limited appetite for managing large Hadoop clusters.
Team size is 3 data engineers and 1 platform engineer.
Budget increase is capped at $20K/month.
PII fields must be encrypted at rest and deleted within 7 days of a valid privacy request.

Context

You are asked to design a production-ready data platform that can process both historical and incremental data using a modern big data stack.

Scale Requirements

Batch volume: 6 TB/day across orders, inventory, and customer events
Streaming volume: 80K events/sec peak from web and mobile applications
Latency target: batch outputs available by 6:00 AM; streaming data queryable within 3 minutes
Retention: raw data for 180 days, curated warehouse tables for 3 years
Data size: average event payload 1.5 KB JSON

Requirements

Design a pipeline that ingests data from OLTP databases, application event streams, and third-party CSV drops.
Use Spark for distributed transformation, including joins, deduplication, schema enforcement, and aggregations.
Support both batch ETL and stream processing paths into a central analytics store.
Orchestrate dependencies, retries, and backfills for daily and hourly jobs.
Implement data quality checks for null keys, duplicate records, schema drift, and late-arriving data.
Provide a strategy for partitioning, storage format, and incremental processing.
Describe monitoring, alerting, and failure recovery for production operations.

Constraints

Existing environment is AWS-based with limited appetite for managing large Hadoop clusters.
Team size is 3 data engineers and 1 platform engineer.
Budget increase is capped at $20K/month.
PII fields must be encrypted at rest and deleted within 7 days of a valid privacy request.

Context

You are asked to design a production-ready data platform that can process both historical and incremental data using a modern big data stack.

Scale Requirements

Batch volume: 6 TB/day across orders, inventory, and customer events
Streaming volume: 80K events/sec peak from web and mobile applications
Latency target: batch outputs available by 6:00 AM; streaming data queryable within 3 minutes
Retention: raw data for 180 days, curated warehouse tables for 3 years
Data size: average event payload 1.5 KB JSON

Requirements

Design a pipeline that ingests data from OLTP databases, application event streams, and third-party CSV drops.
Use Spark for distributed transformation, including joins, deduplication, schema enforcement, and aggregations.
Support both batch ETL and stream processing paths into a central analytics store.
Orchestrate dependencies, retries, and backfills for daily and hourly jobs.
Implement data quality checks for null keys, duplicate records, schema drift, and late-arriving data.
Provide a strategy for partitioning, storage format, and incremental processing.
Describe monitoring, alerting, and failure recovery for production operations.

Constraints

Existing environment is AWS-based with limited appetite for managing large Hadoop clusters.
Team size is 3 data engineers and 1 platform engineer.
Budget increase is capped at $20K/month.
PII fields must be encrypted at rest and deleted within 7 days of a valid privacy request.

Context

You are asked to design a production-ready data platform that can process both historical and incremental data using a modern big data stack.

Scale Requirements

Batch volume: 6 TB/day across orders, inventory, and customer events
Streaming volume: 80K events/sec peak from web and mobile applications
Latency target: batch outputs available by 6:00 AM; streaming data queryable within 3 minutes
Retention: raw data for 180 days, curated warehouse tables for 3 years
Data size: average event payload 1.5 KB JSON

Requirements

Design a pipeline that ingests data from OLTP databases, application event streams, and third-party CSV drops.
Use Spark for distributed transformation, including joins, deduplication, schema enforcement, and aggregations.
Support both batch ETL and stream processing paths into a central analytics store.
Orchestrate dependencies, retries, and backfills for daily and hourly jobs.
Implement data quality checks for null keys, duplicate records, schema drift, and late-arriving data.
Provide a strategy for partitioning, storage format, and incremental processing.
Describe monitoring, alerting, and failure recovery for production operations.

Constraints

Existing environment is AWS-based with limited appetite for managing large Hadoop clusters.
Team size is 3 data engineers and 1 platform engineer.
Budget increase is capped at $20K/month.
PII fields must be encrypted at rest and deleted within 7 days of a valid privacy request.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Modernize Hadoop to Spark Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer

Modernize Hadoop to Spark Pipelines

Context

Scale Requirements

Requirements

Constraints

Modernize Hadoop to Spark Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer