Context
DataCorp, a leading analytics firm, processes large volumes of data daily from various sources including transactional databases and third-party APIs. The existing ETL pipeline, built on Apache Airflow and Apache Spark, has faced challenges with data quality and operational visibility, leading to delayed insights and increased operational costs.
To enhance performance and reliability, the Data Engineering team aims to implement comprehensive monitoring using tools like Splunk and Dynatrace to track pipeline health, data quality, and performance metrics.
Scale Requirements
- Data Volume: 10TB processed daily, with peak loads during month-end reporting.
- Latency: ETL jobs must complete within 2 hours post data ingestion.
- Data Quality: 98% accuracy required for all processed datasets with real-time alerts on data anomalies.
Requirements
- Integrate Splunk for log aggregation and real-time monitoring of ETL job executions.
- Utilize Dynatrace for application performance monitoring, focusing on Spark job metrics and Airflow DAG executions.
- Implement data validation checks at each transformation stage, ensuring schema compliance and accuracy.
- Set up alerting mechanisms for job failures, data quality issues, and performance degradation.
- Create dashboards for visualizing key metrics and trends over time.
Constraints
- Team: 3 data engineers with limited experience in monitoring tools.
- Budget: Monthly budget for monitoring tools capped at $5K.
- Compliance: Must comply with data governance policies and ensure sensitive data is not logged or exposed.