Context
DataCorp, a financial services provider, aggregates large datasets from various internal and external sources (transaction logs, market feeds, user activity). The current batch ETL process, running nightly, struggles with data quality issues and delays in reporting, impacting business decisions. The goal is to design a robust ETL pipeline that can handle 10TB of data daily while ensuring data integrity and quality.
Scale Requirements
- Daily Data Volume: 10TB from diverse sources
- Throughput: Process 1TB/hour during peak hours
- Latency: Data must be available for analysis within 2 hours of extraction
- Retention: Raw data stored for 90 days, aggregated data indefinitely
Requirements
- Develop an ETL pipeline that extracts data from multiple sources (APIs, databases, flat files) and loads it into a Snowflake data warehouse.
- Implement data quality checks including schema validation, deduplication, and anomaly detection during the transformation phase.
- Ensure the pipeline can handle incremental loads and backfill historical data as needed.
- Create an orchestration strategy using Apache Airflow to manage dependencies, scheduling, and monitoring.
- Design a reporting mechanism to track data quality metrics and alert on failures or anomalies.
Constraints
- Infrastructure: Existing AWS setup with Snowflake, S3, and RDS
- Budget: Monthly cloud spend capped at $20K
- Compliance: Must adhere to financial regulations regarding data handling and retention