Context
DataCorp, a financial services provider, aggregates data from various sources including transactional databases, third-party APIs, and CSV files from partners. Currently, the data integration process is manual and error-prone, leading to delays and inconsistencies in reporting. The goal is to automate the ETL process to enable timely and accurate analytics for business intelligence.
Current Architecture
| Component | Technology | Issue |
|---|
| Data Sources | MySQL, REST APIs, CSV files | Manual extraction and transformation |
| Ingestion | Custom scripts (Python) | High maintenance overhead |
| Processing | Batch jobs (Airflow) | Long processing times and lack of data validation |
| Storage | Snowflake | Data inconsistencies and outdated information |
Scale Requirements
- Data Volume: 10 million records from MySQL, 1 million records from APIs, and 500GB from CSV files monthly.
- Latency Target: Data should be available in Snowflake within 2 hours of source updates.
- Retention: Raw data for 30 days, aggregated data indefinitely.
Requirements
- Design an ETL pipeline that extracts data from MySQL, APIs, and CSV sources.
- Implement data transformation rules to clean and normalize data.
- Ensure data quality checks (e.g., schema validation, duplicates removal).
- Load processed data into Snowflake in a structured format.
- Set up orchestration using Apache Airflow for scheduling and monitoring.
Constraints
- Team: 3 data engineers with limited experience in streaming technologies.
- Infrastructure: AWS environment with existing Snowflake setup.
- Budget: $15K/month for cloud resources and tools.