Context
RetailCorp, a leading retail analytics provider, processes large volumes of transaction data daily from various sources, including POS systems, online sales, and customer interactions. The company aims to enhance its ETL pipeline to ensure robust data governance, maintain data quality, and comply with regulations like GDPR and CCPA. The current batch processing approach using Apache NiFi and AWS Glue lacks real-time data quality checks and governance features, leading to inconsistencies and compliance risks.
Scale Requirements
- Data Volume: Processes 10TB of transaction data daily, with peaks during sales events (up to 50TB).
- Latency: ETL jobs should complete within 1 hour, ensuring near real-time data availability for analytics.
- Data Sources: Integrate data from 50+ sources, including structured (SQL databases) and semi-structured (JSON, XML) formats.
Requirements
- Design an ETL pipeline that incorporates data quality checks (e.g., schema validation, anomaly detection) before loading into the data warehouse.
- Implement data lineage tracking to monitor data transformations and ensure compliance with data governance policies.
- Utilize orchestration tools (e.g., Apache Airflow) to manage dependencies and scheduling of ETL jobs with retries on failure.
- Ensure data privacy compliance by implementing data masking and encryption for sensitive information during processing.
- Provide monitoring and alerting for data quality metrics, including missing values and duplicate records.
Constraints
- Team: 5 data engineers, with expertise in AWS and Python.
- Infrastructure: AWS-based environment (S3, Redshift, Glue).
- Budget: $20K/month for cloud services.
- Compliance: Must adhere to GDPR and CCPA regulations, requiring data anonymization and audit trails.