Context
FinPulse, a mid-sized fintech company, runs nightly ETL pipelines on long-lived EC2 VMs using Python, Airflow, and PostgreSQL. Deployments are inconsistent across environments, dependency conflicts frequently break jobs, and the platform team wants to standardize execution using Docker and Kubernetes while improving reliability and observability.
You are asked to design a containerized data platform for batch ETL and light streaming workloads. The new system must support reproducible builds, isolated runtime environments, and controlled rollouts for pipeline code.
Scale Requirements
- Batch jobs: 1,200 Airflow task runs/day across 80 DAGs
- Streaming jobs: 15 low-latency consumers processing ~25K events/sec total
- Data volume: 6 TB/day ingested from application databases, S3 drops, and Kafka topics
- Latency targets: batch SLA < 45 minutes per critical DAG; streaming freshness < 2 minutes
- Retention: raw data 180 days in object storage; curated warehouse tables retained indefinitely
Requirements
- Design a Docker-based packaging strategy for ETL jobs, Airflow workers, and shared libraries.
- Use Kubernetes to schedule and isolate workloads across dev, staging, and prod.
- Support both scheduled batch pipelines and continuously running stream consumers.
- Implement CI/CD for image build, vulnerability scanning, versioning, and rollout.
- Ensure idempotent reruns, backfills, and environment-specific configuration management.
- Add data quality checks before warehouse loads and route failed records for reprocessing.
- Define monitoring for container health, job failures, resource saturation, and SLA breaches.
Constraints
- Existing stack is AWS-based: EKS, S3, RDS PostgreSQL, Kafka, and Snowflake.
- Team has strong Docker experience but limited Kubernetes operations expertise.
- Incremental platform budget is capped at $18K/month.
- Compliance requires image provenance, secrets management, and audit logs for production deployments.