Context
ShopSmart, a leading e-commerce platform, processes a daily average of 10TB of data from various sources including transaction logs, user activity, and inventory systems. Currently, the company uses a basic ETL process that fails to ensure data quality and lacks the flexibility to adapt to new data sources. The goal is to redesign the ETL pipeline to improve data integrity, enable faster analytics, and comply with GDPR regulations.
Scale Requirements
- Data Volume: 10TB daily, scaling to 20TB during peak seasons (e.g., Black Friday).
- Batch Frequency: Hourly ETL jobs to ensure data freshness.
- Latency Target: Data should be available for analysis within 1 hour of extraction.
- Retention: Raw data for 90 days, aggregated data for 2 years.
Requirements
- Design a modular ETL pipeline capable of integrating data from multiple sources (RDBMS, APIs, flat files).
- Implement data validation checks to ensure data quality (e.g., schema validation, null checks).
- Ensure compliance with GDPR by incorporating data anonymization and user deletion processes.
- Utilize a metadata management system to track data lineage and transformations.
- Set up monitoring and alerting for data quality metrics and ETL job performance.
Constraints
- Infrastructure: AWS-based architecture (existing RDS, S3, and Glue).
- Budget: Limited to $15K/month for cloud services.
- Team: 3 data engineers with experience in Python and SQL, but limited knowledge of orchestration tools.
- Compliance: Must support user data deletion requests within 72 hours.