Context
FinEdge, a B2B payments company, runs 120 production data pipelines that ingest PostgreSQL CDC, S3 batch files, and Kafka event streams into Snowflake. Today, deployments are manual: Airflow DAGs are copied into production, dbt models are promoted without automated tests, and infrastructure changes are applied ad hoc, causing broken dependencies, schema drift, and rollback pain.
You need to design a CI/CD platform for the data engineering team that standardizes build, test, deploy, and rollback for pipelines across dev, staging, and prod. The goal is to reduce failed releases while preserving data correctness and deployment speed.
Scale Requirements
- Pipelines: 120 active pipelines, growing to 250 within 12 months
- Deploy frequency: 40-60 merges/day across DAGs, dbt models, and Terraform
- Data volume: 8 TB/day batch + 150K events/sec streaming peak
- SLA impact: Critical pipelines must recover from bad deploys in < 15 minutes
- Environments: 3 isolated environments with separate Airflow, Snowflake databases, and AWS accounts
- Test runtime target: PR validation < 12 minutes, full staging validation < 30 minutes
Requirements
- Design CI/CD for Airflow DAGs, dbt transformations, Spark jobs, and Terraform-managed infrastructure.
- Include automated checks for linting, unit tests, SQL tests, schema compatibility, data quality, and security scanning.
- Support environment promotion from dev to staging to prod with approval gates for critical pipelines.
- Ensure idempotent deployments, versioned artifacts, and reproducible rollback for code and infrastructure.
- Prevent bad releases from corrupting production tables or breaking downstream dependencies.
- Define how streaming and batch pipelines are deployed differently, including zero-downtime or low-risk rollout patterns.
- Include observability for deployment health, pipeline freshness, and post-deploy data validation.
Constraints
- Existing stack is AWS, GitHub, Airflow 2.x, dbt Core, Spark on EMR, Snowflake, Terraform.
- Team has 6 data engineers and 1 platform engineer; operational complexity should stay moderate.
- SOX-style auditability is required: every production change must be traceable to a PR, test run, approver, and artifact version.
- Incremental platform budget is capped at $18K/month.