Context
DataCorp, a leading AI research firm, processes large datasets from various sources, including IoT sensors, social media, and transactional systems. The current ETL architecture, which relies on batch processing, struggles to meet the increasing demand for real-time data processing and analytics. To enhance responsiveness and data quality, a new high-performance ETL pipeline is required.
Current Architecture
| Component | Technology | Issue |
|---|
| Data Sources | IoT devices, APIs, databases | 10TB data daily, high variability |
| Ingestion | Batch jobs every 24 hours | High latency, stale data |
| Processing | Apache Spark on EMR | Limited scalability, slow processing |
| Storage | Amazon S3 and Redshift | Complex data retrieval |
| Orchestration | Apache Airflow | Inefficient scheduling and dependency management |
Scale Requirements
- Throughput: Process 10TB of data daily, with peak loads reaching 1TB/hour.
- Latency target: Data should be available for analytics within 10 minutes of ingestion.
- Retention: Store raw data for 90 days and aggregated data indefinitely.
Requirements
- Design an ETL pipeline capable of handling 1TB/hour with real-time ingestion capabilities.
- Implement data quality checks, including schema validation and anomaly detection.
- Ensure data transformations are efficient, leveraging distributed processing.
- Integrate monitoring and alerting for pipeline health and data quality metrics.
- Maintain backward compatibility with existing batch systems during the transition phase.
Constraints
- Team: 5 data engineers with experience in Spark and AWS.
- Infrastructure: AWS-based with existing EMR, S3, and Redshift.
- Budget: $20K/month for cloud services, including compute and storage.