Context
FinTech Corp, a financial services company, operates in a hybrid infrastructure with both bare metal servers and virtualized environments (VMware). The current ETL process involves manual data validation and scheduling, leading to inconsistencies and delayed reporting. The company aims to automate these processes while ensuring data quality across both environments.
Scale Requirements
- Data Volume: 10TB of data processed daily from various sources (transaction logs, user activity).
- Latency: ETL jobs must complete within 2 hours to provide near real-time reporting.
- Concurrency: Support for 100 simultaneous ETL jobs without performance degradation.
Requirements
- Design an ETL pipeline that seamlessly integrates data from bare metal and virtualized environments.
- Implement automated data quality checks (schema validation, completeness checks) to ensure data integrity.
- Schedule and orchestrate ETL jobs using a tool that can handle dependencies and retries.
- Ensure that the pipeline can scale horizontally to accommodate increasing data volumes.
- Provide detailed logging and monitoring of ETL job statuses and performance metrics.
Constraints
- Infrastructure: Limited to existing hardware and VMware infrastructure.
- Budget: Monthly operational cost must not exceed $10K.
- Compliance: Must adhere to financial regulations regarding data handling and retention.