Context
TechGadget, a consumer electronics company, is launching a new product line and requires a robust ETL pipeline to process data from multiple sources, including sales forecasts, inventory levels, and customer feedback. The current system relies on manual data entry and lacks integration, leading to inconsistencies and delays in reporting.
Scale Requirements
- Data Sources: 3 external APIs (sales forecasts, inventory, customer feedback) with a combined throughput of ~1000 records/min.
- Data Volume: Anticipated 1 million records per month post-launch.
- Latency Target: Data must be available for reporting within 10 minutes of extraction.
- Storage: PostgreSQL database with a requirement for at least 500GB of storage capacity.
Requirements
- Implement an ETL pipeline to extract data from three APIs (REST) and transform it for consistency.
- Perform data quality checks, including validation of data formats and deduplication.
- Load the transformed data into a PostgreSQL database.
- Schedule the pipeline to run every 5 minutes using Apache Airflow.
- Provide a monitoring dashboard to track ETL job success rates and data quality metrics.
Constraints
- Team: 2 data engineers with experience in Python and SQL.
- Infrastructure: On-premises PostgreSQL server, limited cloud budget.
- Compliance: Ensure data handling adheres to GDPR regulations regarding customer data.