Context
DataCorp, a financial analytics firm, processes large volumes of transactional data from multiple sources, including APIs, databases, and flat files. The current ETL pipeline, orchestrated using Apache Airflow, has become increasingly complex, leading to issues with job dependencies and data quality. The VP of Data Engineering has tasked the team with redesigning the pipeline to improve dependency management and ensure data quality across all transformations.
Scale Requirements
- Data Volume: 10 TB of data ingested daily from 5 different sources.
- Batch Size: Each job processes 1 million records, with an average record size of 1 KB.
- Latency Target: Jobs should complete within 2 hours to meet reporting deadlines.
- Concurrency: Support for 50 concurrent jobs without degradation in performance.
Requirements
- Implement a dependency management system that tracks job prerequisites and execution order.
- Include data quality checks at each stage of the ETL process, such as schema validation and anomaly detection.
- Ensure idempotency in data processing to avoid duplicate records during retries or failures.
- Provide detailed logging and monitoring for each job, including success/failure metrics and execution times.
- Design a rollback mechanism for failed jobs to maintain data integrity.
Constraints
- Infrastructure: Limited to AWS services (EC2, S3, RDS).
- Budget: Monthly budget capped at $10K for cloud services.
- Compliance: Must adhere to financial regulations requiring data auditability and traceability.