Context
DataCorp, a financial analytics company, operates ETL pipelines to process transactional data from various sources including databases and APIs. As the team scales, discrepancies in configurations and data schemas across development, staging, and production environments are leading to inconsistent data quality and processing failures. The VP of Data Engineering has mandated that the ETL pipeline must be designed to handle these discrepancies effectively.
Scale Requirements
- Data Volume: 100M records daily across all environments.
- Latency: Real-time processing with a target of < 10 seconds from ingestion to storage.
- Throughput: 5,000 records/second average, with peaks up to 20,000 records/second.
Requirements
- Implement environment-specific configurations using a centralized configuration management tool (e.g., HashiCorp Consul).
- Design the ETL pipeline to validate schema and data types dynamically based on the environment.
- Ensure the pipeline can handle missing fields or type mismatches gracefully, logging discrepancies for review.
- Use orchestration tools (e.g., Apache Airflow) to manage dependencies and automate deployment across environments.
- Include monitoring to track discrepancies and alert the team when thresholds are exceeded (e.g., > 5% of records fail validation).
Constraints
- Team: 3 data engineers, 1 DevOps engineer.
- Infrastructure: AWS-based (using S3, Redshift, and Lambda).
- Budget: Limited to $15K/month for cloud resources and tools.