Context
DataCorp, a financial services company, processes large volumes of transactional data from various sources, including APIs, databases, and external feeds. The current ETL pipeline, built on Apache Airflow and PostgreSQL, struggles to handle the growing data volume of 10TB per day, leading to increased latency and operational costs. The data engineering team is tasked with redesigning the pipeline to improve scalability and reduce latency to 30 minutes.
Scale Requirements
- Throughput: 10TB of data daily, approximately 115 MB/sec
- Latency target: Data should be available for querying within 30 minutes of arrival
- Retention: Raw data for 90 days, aggregated summaries indefinitely
Requirements
- Design an ETL architecture that scales to handle 10TB of daily data with the specified latency.
- Implement data quality checks to ensure accuracy and completeness of data.
- Utilize a data warehouse (e.g., Snowflake) for efficient storage and querying.
- Provide monitoring and alerting mechanisms to ensure pipeline health.
- Ensure the solution can handle schema changes without downtime.
Constraints
- Infrastructure: Must operate on existing AWS infrastructure (EC2, S3, RDS).
- Budget: Limited to $20K/month for additional cloud resources.
- Compliance: Must adhere to financial regulations regarding data retention and security.