Context
DataCorp, a financial analytics company, currently manages its ETL processes using traditional tools like Apache Airflow but faces challenges with complex workflows and data quality checks. The existing setup struggles to handle the increasing volume of data from various APIs and databases, leading to delays in data availability for analytics. The goal is to implement a more flexible orchestration framework using LangChain to streamline ETL processes while ensuring data quality.
Scale Requirements
- Data Sources: 10+ APIs and databases, with daily data ingestion volume of ~5TB.
- Processing Frequency: ETL jobs need to run every 15 minutes.
- Latency Target: Data should be available for querying within 10 minutes of extraction.
- Retention: Raw data stored for 30 days, transformed data indefinitely.
Requirements
- Use LangChain to orchestrate ETL workflows, integrating with various data sources.
- Implement data validation checks (schema validation, duplicate detection) during the extraction phase.
- Transform raw data into analytics-ready formats (e.g., aggregations, joins) before loading.
- Store transformed data in a Snowflake data warehouse with appropriate data models.
- Set up monitoring and alerting for data quality metrics and job failures.
Constraints
- Team: 3 data engineers with limited experience in LangChain.
- Infrastructure: AWS-based with existing Snowflake and S3.
- Budget: $10K/month for additional tools and services.