Context
RetailCorp, a major retail chain, collects vast amounts of transactional data from over 1,000 stores nationwide. Currently, data is processed in daily batch jobs using on-premise infrastructure, leading to delays in analytics and reporting. The data engineering team aims to transition to a cloud-based ETL pipeline to improve data freshness and scalability.
Scale Requirements
- Data Volume: 10TB of transaction data weekly, averaging 1.5TB daily.
- Latency: ETL jobs must complete within 2 hours of data availability.
- Concurrency: Support 50 simultaneous ETL jobs during peak hours.
- Retention: Store raw data for 30 days and aggregated data indefinitely.
Requirements
- Design an ETL pipeline that extracts data from multiple sources (POS systems, online transactions, and inventory databases).
- Implement data quality checks (schema validation, completeness checks, and anomaly detection).
- Transform and aggregate data for analytics, creating summary tables in Snowflake.
- Schedule and orchestrate ETL jobs using a cloud-native orchestration tool.
- Enable monitoring and alerting for data quality issues and job failures.
Constraints
- Infrastructure: Transition to AWS or GCP for cloud services.
- Budget: Monthly cloud expenditure must not exceed $15K.
- Compliance: Must adhere to PCI DSS for handling payment information.