Context
DataAI, a machine learning platform, processes vast amounts of data daily for training models. Currently, the data annotation process is manual and inconsistent, leading to quality issues and delays. To enhance model accuracy and speed up the training process, the VP of Data Engineering wants to design an automated ETL pipeline that ingests raw data, applies necessary transformations, and outputs annotated data ready for machine learning.
Scale Requirements
- Data Volume: Process 1 million records daily, with potential growth to 5 million.
- Throughput: Ensure the pipeline can handle 50,000 records per hour.
- Latency: Data should be available for analysis within 2 hours of ingestion.
- Storage: Store raw and annotated data in a data lake (e.g., S3) with a retention policy of 30 days for raw data.
Requirements
- Ingest data from various sources (CSV, JSON) into a staging area.
- Implement data quality checks (e.g., schema validation, completeness checks).
- Annotate data using pre-defined rules and machine learning models.
- Store processed data in a structured format (Parquet) for efficient querying.
- Schedule and orchestrate the pipeline using Apache Airflow with retry mechanisms.
- Monitor data quality and pipeline health with alerting mechanisms.
Constraints
- Team: 3 data engineers, limited experience with data quality frameworks.
- Infrastructure: AWS-based with existing S3, Redshift, and Airflow.
- Budget: Limited to $20K/month for additional resources and tools.
- Compliance: Ensure adherence to data privacy regulations (GDPR).