Context
DataCorp, a data analytics company, utilizes a Kubernetes-based architecture to orchestrate ETL jobs that process large datasets from various sources (e.g., MySQL, MongoDB, and S3). Recently, one of the ETL pods has been experiencing a CrashLoopBackOff state, causing delays in data processing and impacting downstream analytics.
Scale Requirements
- Pods: 10 ETL pods running concurrently, each processing 100GB of data per hour.
- Throughput: Each pod must handle 500 records/second.
- Data Size: Average record size is 2KB, leading to 1GB of data processed per pod per hour.
- Latency: Jobs should complete within 1 hour.
Requirements
- Identify the root cause of the CrashLoopBackOff state for the affected pod.
- Review logs and metrics to determine the failure points and patterns.
- Implement health checks and readiness probes to prevent future occurrences.
- Ensure that the pod can recover gracefully from transient errors without affecting overall ETL pipeline performance.
- Document the debugging process and solutions implemented for future reference.
Constraints
- Infrastructure: Limited to existing Kubernetes cluster resources (e.g., 32 CPU cores, 64GB RAM).
- Budget: Minimal additional costs allowed for troubleshooting tools.
- Compliance: Must adhere to data governance policies regarding data handling and processing failures.