Context
RetailCorp, a leading e-commerce platform, handles approximately 10TB of sales data daily from multiple sources, including transactional databases, web logs, and third-party APIs. The current ETL process is batch-oriented, running nightly, which leads to data freshness issues and delayed insights for the business analytics team. To address this, the VP of Data Engineering has mandated the design of a new ETL pipeline that ensures data quality and provides near real-time analytics capabilities.
Current Architecture
| Component | Technology | Issue |
|---|
| Data Sources | MySQL, REST APIs, and S3 | Daily batch load causes data latency |
| ETL Tool | Apache Nifi | Limited data quality checks |
| Storage | Amazon Redshift | Slow query performance due to unoptimized data structure |
| Orchestration | Apache Airflow | Complex and difficult to manage |
Scale Requirements
- Throughput: Process 10TB of data daily, averaging 400GB/hour.
- Latency: Ensure data is available for querying within 1 hour of ingestion.
- Retention: Store raw data for 30 days and aggregated data indefinitely.
Requirements
- Design an ETL pipeline that ingests data from multiple sources, ensuring data integrity and quality checks at each stage.
- Implement transformations to optimize data for analytics, including deduplication, validation, and schema enforcement.
- Load processed data into Amazon Redshift with optimized table structures for performance.
- Create monitoring and alerting mechanisms for data quality issues, latency, and system health.
- Ensure the pipeline is orchestrated using Apache Airflow with clear dependencies and error handling.
Constraints
- Team: 5 data engineers with experience in Python and AWS.
- Infrastructure: AWS-based environment (Redshift, S3, Lambda).
- Budget: $15K/month for cloud services.