Context
DataCorp, a financial analytics firm, has developed multiple machine learning models to predict customer behavior and fraud detection. Currently, the deployment of these models is manual and lacks integration with the existing ETL pipeline, leading to data quality issues and delayed insights. The goal is to automate the deployment of these models within an ETL framework to provide real-time predictions and ensure data integrity.
Scale Requirements
- Throughput: Process 500,000 records per hour (approximately 140 records per second).
- Latency: Predictions should be available within 2 minutes of data ingestion.
- Data Volume: Handle daily ingestion of 10 million records, with a storage requirement of 1TB per day.
- Retention: Store raw data for 30 days and predictions for 90 days.
Requirements
- Design an ETL pipeline that integrates the deployment of ML models for real-time predictions.
- Implement data validation and quality checks at each stage of the pipeline.
- Ensure that the pipeline can handle schema evolution and model updates seamlessly.
- Provide monitoring and alerting for data quality issues and model performance metrics.
- Create a rollback mechanism for model deployments in case of failures.
Constraints
- Team: 3 data engineers with expertise in Python and SQL, but limited experience with ML deployment.
- Infrastructure: AWS-based, using S3, Lambda, and Redshift.
- Budget: Limited to $15,000/month for additional compute resources.