Context
PersonalizeCo, a retail company with a diverse customer base, collects user interaction data from various channels such as web, mobile app, and in-store systems. Currently, the company processes this data in daily batches, resulting in data latency that hampers timely marketing campaigns and customer engagement strategies. To improve responsiveness and personalization, the company seeks to implement an ETL pipeline that can handle real-time data ingestion and processing.
Scale Requirements
- Throughput: Handle 200K events/sec during peak shopping hours (e.g., Black Friday).
- Event Size: Each event is approximately 1KB in size (user interactions, purchases, etc.).
- Daily Volume: Expected to process around 17.3TB of raw data daily.
- Latency Target: Data should be available for analytics within 2 minutes of event occurrence.
- Retention: Raw data retention for 30 days, with aggregated analytics data kept indefinitely.
Requirements
- Design an ETL pipeline capable of processing 200K events/sec.
- Implement real-time data quality checks (e.g., schema validation, duplicate detection).
- Transform raw data into user profiles and interaction histories for analytics.
- Load processed data into a Snowflake data warehouse with <2 minutes latency.
- Set up monitoring and alerting for data quality and processing failures.
- Ensure backward compatibility with existing batch data consumers.
Constraints
- Team: 5 data engineers with experience in Spark and Snowflake.
- Infrastructure: AWS-based architecture (existing EC2, S3, Snowflake).
- Budget: $30K/month for cloud resources.
- Compliance: Must adhere to GDPR regulations regarding user data handling.