Context
Northstar Health runs 120+ batch and streaming data pipelines that move data from PostgreSQL, Kafka, and S3 into Snowflake using Airflow, dbt, and Spark. Today, deployments are manual: engineers merge to main, update DAGs directly, and run ad hoc validation scripts, causing broken dependencies, inconsistent environments, and slow rollback during incidents.
Your task is to design a CI/CD pipeline for the data engineering team that standardizes build, test, deploy, and rollback for pipeline code, SQL models, and infrastructure changes.
Scale Requirements
- Repositories: 8 repos, ~250 deployable assets (Airflow DAGs, dbt models, Spark jobs, Terraform)
- Change volume: ~40 PRs/day, 8-12 production deployments/day
- Data scale: 15 TB/day processed, 30K Airflow task runs/day
- Deployment target: CI feedback in < 10 minutes for standard PRs
- Promotion target: Staging to production in < 15 minutes after approval
- Reliability target: Failed deployment rate < 2%, rollback initiation < 5 minutes
Requirements
- Build a CI pipeline that validates Python, SQL, dbt, Spark, and Terraform changes on every pull request.
- Enforce automated tests: unit tests, DAG import tests, dbt compile/tests, schema checks, and data quality gates.
- Support environment promotion across dev → staging → prod with approval gates and artifact versioning.
- Ensure idempotent deployments for Airflow DAGs, dbt projects, containerized Spark jobs, and infrastructure.
- Detect dependency issues before release, including upstream schema changes and broken DAG references.
- Provide rollback mechanisms for failed deployments without corrupting production data or rerunning completed loads incorrectly.
- Include monitoring, auditability, and deployment metadata for compliance reviews.
Constraints
- AWS-first stack; existing services include ECR, MWAA, EMR, S3, Snowflake, and Terraform Cloud.
- Team size is 5 data engineers and 1 platform engineer; operational complexity should stay moderate.
- HIPAA-sensitive datasets require audit logs, least-privilege access, and separation of lower and production environments.
- Monthly incremental platform budget is capped at $18K.