Context
FinWave, a payments analytics company, runs 120 batch and streaming data pipelines across Apache Airflow, dbt, Spark, and Kafka on AWS. Today, deployments are manual: engineers merge code to main, trigger Airflow DAG updates by hand, and promote dbt/Spark changes without automated testing, causing broken DAGs, schema regressions, and inconsistent production releases.
You need to design a CI/CD platform for data pipelines that standardizes build, test, deployment, rollback, and observability across batch ETL, ELT, and stream processing workloads.
Scale Requirements
- Pipelines: 120 active pipelines, growing to 250 within 12 months
- Deployments: 40-60 production releases/day across DAGs, dbt models, and Spark jobs
- Data volume: 15 TB/day batch + 180K events/sec streaming peak
- Latency: CI feedback < 10 minutes for PR validation; CD promotion to prod < 15 minutes
- Environments: dev, staging, prod across 3 AWS accounts
- Recovery target: rollback or disable faulty release within 5 minutes
Requirements
- Design a CI pipeline that validates Python, SQL, DAG definitions, dbt models, and Spark jobs before merge.
- Implement automated tests for unit, integration, schema compatibility, and data quality checks.
- Design CD workflows for Airflow DAG deployment, dbt model promotion, and Spark job release with environment-specific configs.
- Ensure idempotent deployments, versioned artifacts, and reproducible builds.
- Support safe rollout strategies for streaming jobs and backward-compatible schema evolution.
- Include secrets management, approval gates for production, and rollback mechanisms.
- Define monitoring for deployment health, pipeline failures, test flakiness, and post-release data incidents.
Constraints
- AWS-first stack; existing services include EKS, MWAA, S3, EMR, Kafka MSK, Snowflake
- Team of 6 data engineers and 1 platform engineer; solution must minimize operational overhead
- Monthly platform budget increase capped at $18K
- SOX and PCI controls require audit logs, approval history, and restricted production access
- Some pipelines are stateful streaming jobs and cannot tolerate duplicate processing during redeployments