Context
DataCorp, a data analytics company, has observed that the build times of its ETL pipeline have doubled over the last month, impacting data freshness and reporting capabilities. The pipeline processes data from various sources (CSV files, APIs) into a Snowflake data warehouse, with orchestration managed by Apache Airflow. The increase in build times has raised concerns about data quality and operational efficiency.
Scale Requirements
- Current Throughput: 2TB of data processed daily, targeting a build time of under 2 hours.
- Data Volume: Approximately 200 files per day, averaging 10GB each.
- Latency Target: Data should be available in Snowflake within 2 hours after extraction.
Requirements
- Analyze the ETL pipeline's performance metrics to identify bottlenecks.
- Implement monitoring strategies to track data quality (e.g., completeness, accuracy) and processing times.
- Optimize data transformation steps to reduce processing time without sacrificing data integrity.
- Ensure orchestration with Apache Airflow is efficient, with minimal task dependencies and optimal scheduling.
- Develop a rollback plan for any changes made to the ETL process.
Constraints
- Infrastructure: Limited to existing AWS resources (EC2 for processing, S3 for storage).
- Budget: No additional budget for new tools or services.
- Compliance: Must adhere to data governance policies, including data retention and user privacy regulations.