Context
DataGen Corp, a machine learning platform, processes vast amounts of data daily to train models for predictive analytics. Currently, their ETL jobs are executed in a monolithic architecture, leading to inefficiencies and slow processing times. The engineering team aims to redesign the ETL pipeline using distributed computing libraries like NCCL or MPI to enhance throughput and reduce latency.
Scale Requirements
- Data Volume: 5TB of data daily, with peaks of 500GB during model training.
- Throughput: Must support processing 10 million records per minute.
- Latency: Targeting data availability within 10 minutes of ingestion.
- Storage: Use a data lake architecture for raw and processed data, with retention policies for 90 days.
Requirements
- Implement a distributed ETL pipeline using NCCL or MPI for parallel processing.
- Ensure data validation and quality checks during the extraction phase.
- Transform data into a format suitable for machine learning (e.g., feature engineering).
- Load processed data into a data lake (e.g., AWS S3) with metadata cataloging.
- Integrate orchestration tools (e.g., Apache Airflow) for scheduling and monitoring.
Constraints
- Infrastructure: Must run on existing AWS resources (EC2, S3).
- Budget: Limited to a $15K monthly operational cost.
- Compliance: Must adhere to data privacy regulations (GDPR, CCPA).