Context
Northstar Retail runs 150+ daily and hourly ETL/ELT jobs that move order, inventory, and customer data from PostgreSQL, Kafka, and S3 into Snowflake. The current deployment process uses a single Airflow environment with manual approvals and direct production releases, causing frequent incidents during weekly schema, dbt, and DAG changes.
You are asked to redesign the deployment pipeline so the team can ship multiple changes per day while reducing release risk, rollback time, and data quality regressions.
Scale Requirements
- Pipelines: 150+ Airflow DAGs, 40 dbt models, 25 Spark batch jobs
- Data volume: 12 TB/day across batch and streaming sources
- Release frequency target: 10-20 production deployments/day
- Freshness SLO: 95% of critical tables updated within 15 minutes of schedule
- Recovery target: rollback or forward-fix within 10 minutes
- Retention: 180 days raw in S3, 3 years curated in Snowflake
Requirements
- Design a CI/CD pipeline for Airflow DAGs, dbt models, Spark jobs, and infrastructure-as-code.
- Add automated validation before production deploys: unit tests, schema checks, SQL linting, data contract validation, and integration tests against staging data.
- Support low-risk rollout patterns such as canary DAGs, blue/green Airflow environments, and feature flags for new transformations.
- Ensure deployments are idempotent and safe for backfills, retries, and partial failures.
- Define how to detect regressions in freshness, row counts, null rates, and duplicate rates immediately after release.
- Provide a rollback strategy for code, configuration, and data artifacts.
- Explain how secrets, environment promotion, and artifact versioning are handled.
Constraints
- AWS is the required cloud; current stack already uses Airflow 2.x, dbt Core, EMR Spark, Snowflake, and Terraform.
- Team size is 5 data engineers and 1 platform engineer; operational complexity should stay moderate.
- Production data cannot be copied freely into lower environments due to PCI controls; use masked subsets or synthetic data.
- Incremental cloud/tooling budget increase should stay under $15K/month.