Context
Northstar Retail runs 120 batch and near-real-time data pipelines on AWS using Apache Airflow, dbt, Spark, and Snowflake. Today, DAGs and transformation code are deployed manually from developer laptops, causing inconsistent environments, failed releases, and no reliable rollback path.
You need to design a CI/CD pipeline for the data platform so code changes can be validated, tested, deployed, and monitored safely across dev, staging, and production.
Scale Requirements
- Pipelines managed: 120 Airflow DAGs, 450 dbt models, 35 Spark jobs
- Deploy frequency: 20-30 merges/day across 8 engineers
- Latency target: CI feedback in < 10 minutes for standard changes; production deploy < 15 minutes
- Data volume affected: ~14 TB/day across S3 and Snowflake
- Availability target: 99.9% successful scheduled runs after deployment
- Rollback target: Restore previous stable release in < 5 minutes
Requirements
- Design a CI pipeline that validates Python, SQL, DAG definitions, dbt models, and infrastructure changes on every pull request.
- Include automated unit, integration, and data quality tests before promotion to production.
- Support environment-specific configuration and secrets management without hardcoding credentials.
- Ensure deployments are idempotent and prevent partial releases across Airflow, dbt, and Spark assets.
- Define a release strategy for dev, staging, and production, including approvals and rollback.
- Add monitoring for deployment health, pipeline failures, and post-release data quality regressions.
- Explain how you would handle schema changes, backfills, and breaking DAG updates safely.
Constraints
- Existing stack must remain AWS + Snowflake; no full platform rewrite
- Team is small: 3 data engineers, 1 platform engineer
- Monthly incremental tooling budget is capped at $12K
- Production changes must be auditable for SOX compliance
- Some pipelines process PII, so secrets and artifacts must be encrypted at rest and in transit