Context
RetailCorp, a leading retail chain, processes over 10 million transactions daily across its online and physical stores. Currently, the company uses a nightly batch ETL process to load sales data into its data warehouse, resulting in significant delays for reporting and analytics. To enhance operational efficiency and decision-making, the VP of Data Engineering has mandated the creation of a real-time ETL pipeline.
Scale Requirements
- Throughput: Process 10 million transactions daily (~115 transactions/sec peak).
- Latency target: Data should be available for querying in the data warehouse within 5 minutes of the transaction occurring.
- Storage: Store raw and transformed data in a cloud-based data warehouse, with retention of raw data for 90 days and aggregated data indefinitely.
Requirements
- Design a real-time ingestion pipeline that captures sales transactions from multiple sources (POS systems, e-commerce platforms).
- Implement data quality checks to validate transaction data (e.g., schema validation, duplicate detection).
- Transform data into a structured format suitable for analytics (e.g., aggregating daily sales, calculating metrics).
- Load transformed data into a cloud-based data warehouse (e.g., Snowflake) with < 5 minute latency.
- Create monitoring and alerting mechanisms for data quality and pipeline performance.
Constraints
- Team: 5 data engineers with expertise in AWS and SQL but limited experience in streaming technologies.
- Infrastructure: AWS-based (existing S3, Redshift).
- Budget: $20K/month for cloud services.