Context
ShopWave, a mid-size retail marketplace, currently runs nightly Hadoop MapReduce jobs on HDFS to process order, inventory, and clickstream data for reporting. The platform has grown to support near-real-time operational dashboards and faster analytics, so the data team wants to redesign the pipeline using Spark while preserving batch reliability and adding streaming support.
You are asked to design a production-ready data platform that can process both historical and incremental data using a modern big data stack.
Scale Requirements
- Batch volume: 6 TB/day across orders, inventory, and customer events
- Streaming volume: 80K events/sec peak from web and mobile applications
- Latency target: batch outputs available by 6:00 AM; streaming data queryable within 3 minutes
- Retention: raw data for 180 days, curated warehouse tables for 3 years
- Data size: average event payload 1.5 KB JSON
Requirements
- Design a pipeline that ingests data from OLTP databases, application event streams, and third-party CSV drops.
- Use Spark for distributed transformation, including joins, deduplication, schema enforcement, and aggregations.
- Support both batch ETL and stream processing paths into a central analytics store.
- Orchestrate dependencies, retries, and backfills for daily and hourly jobs.
- Implement data quality checks for null keys, duplicate records, schema drift, and late-arriving data.
- Provide a strategy for partitioning, storage format, and incremental processing.
- Describe monitoring, alerting, and failure recovery for production operations.
Constraints
- Existing environment is AWS-based with limited appetite for managing large Hadoop clusters.
- Team size is 3 data engineers and 1 platform engineer.
- Budget increase is capped at $20K/month.
- PII fields must be encrypted at rest and deleted within 7 days of a valid privacy request.