Context
DataCorp, a financial services company, processes large volumes of transactional data daily from various sources like APIs, databases, and flat files. Currently, their ETL processes run nightly, but frequent job failures lead to data inconsistencies and delayed reporting. The VP of Data Engineering wants to redesign the ETL pipeline to ensure robust error handling and data consistency while integrating real-time data processing capabilities.
Scale Requirements
- Throughput: Process 10 million records daily, with a peak of 1,000 records per second during batch runs.
- Latency: Ensure that data is available for reporting within 30 minutes of extraction.
- Storage: Store processed data in Snowflake, with an expected growth of 10TB per year.
Requirements
- Design an ETL pipeline that extracts data from multiple sources (APIs, PostgreSQL, and CSV files).
- Implement error handling strategies that log failures, retry jobs, and ensure data consistency.
- Ensure idempotency of ETL jobs to avoid duplicate records in Snowflake.
- Create monitoring dashboards to visualize job status, error rates, and data quality metrics.
- Implement data validation checks before loading into Snowflake, including schema validation and data type checks.
Constraints
- Team: 3 data engineers with limited experience in streaming technologies.
- Infrastructure: Existing AWS stack with limited budget for additional tools.
- Compliance: Must adhere to financial regulations and data privacy laws.