Context
NovaRetail runs its batch ETL platform on Kubernetes and wants junior data engineers to understand the core cluster components before they operate Airflow, Spark, and dbt workloads in production. The current issue is frequent confusion during incident response: engineers can deploy pipelines, but cannot explain how the Kubernetes control plane, worker nodes, Kubelet, and etcd affect scheduling, job execution, and recovery.
You are asked to explain and lightly design the Kubernetes architecture that supports a containerized data platform.
Scale Requirements
- Cluster size: 1 control plane, 12 worker nodes
- Workloads: 800 Airflow task pods/day, 120 Spark driver/executor pods/day, 40 dbt job pods/day
- Latency target: New batch job pod scheduled within 30 seconds under normal load
- Storage: etcd metadata under 50 GB; pipeline logs retained 30 days in object storage
- Availability: 99.9% scheduler/API availability during business hours
Requirements
- Describe the role of the control plane in managing cluster state, scheduling, and API access for data workloads.
- Explain what worker nodes do and how they run ETL containers such as Airflow workers or Spark executors.
- Explain the Kubelet lifecycle responsibilities on each node, including pod registration, health reporting, and container restart behavior.
- Describe how etcd stores cluster state and why it is critical for orchestration reliability.
- Walk through what happens when an Airflow DAG launches a KubernetesPodOperator task from API request to running pod.
- Identify at least three failure scenarios and how they affect pipeline execution and recovery.
- Propose basic monitoring and operational checks for these components.
Constraints
- AWS-based environment using Amazon EKS
- Small platform team: 3 data engineers, 1 DevOps engineer
- No custom Kubernetes operators beyond standard Airflow and Spark-on-K8s deployment patterns
- Must support auditability for job execution and infrastructure changes