Context
NovaRetail runs Airflow, Spark batch jobs, and a small fleet of Kafka consumers on Amazon EKS. The platform team currently deploys all pipeline components as generic Kubernetes Deployments, which has caused issues with stable identities for stateful workers, duplicate processing during restarts, and inconsistent node-level log collection.
You need to explain when to use Deployments, StatefulSets, and DaemonSets in this data platform, and propose how each should be applied to pipeline workloads.
Scale Requirements
- Cluster size: 60 EKS worker nodes across 3 AZs
- Airflow workloads: 2 schedulers, 20-200 ephemeral task pods/day
- Streaming consumers: 48 Kafka partitions, target consumer lag < 30 seconds
- Stateful services: 3 metadata/cache replicas with persistent volumes
- Node-level agents: 1 pod per node for logs and metrics
- Availability target: 99.9% for orchestration and ingestion services
Requirements
- Describe the operational differences between Deployments, StatefulSets, and DaemonSets in Kubernetes.
- Map each controller to concrete data engineering workloads such as Airflow webserver/scheduler, Kafka consumers, metadata databases, and node-level observability agents.
- Explain implications for scaling, pod identity, storage, rolling updates, and failure recovery.
- Show how you would deploy at least one stateless service and one node-level agent on EKS.
- Define monitoring and alerting for rollout failures, pod churn, unavailable replicas, and node coverage.
- Discuss trade-offs if the team wants to minimize operational complexity while preserving reliability for ETL and streaming jobs.
Constraints
- AWS-first stack: EKS, EBS, CloudWatch, Prometheus, Grafana
- Small platform team: 3 engineers supporting all data infrastructure
- Budget-sensitive: avoid overprovisioning dedicated nodes
- Compliance: production logs must be collected from every node and retained for 30 days
- Existing workloads cannot tolerate more than 5 minutes of orchestration downtime