Context
Northstar Retail runs 120+ batch and streaming data pipelines that move data from PostgreSQL, Kafka, and S3 into Snowflake for analytics and operational reporting. The current deployment process is manual: engineers merge to main, build Docker images in GitHub Actions, and deploy Airflow DAGs, dbt models, and Spark jobs directly into production with limited validation, causing broken DAGs, schema drift incidents, and rollback delays.
You are asked to walk through the team’s current deployment pipeline and redesign it for safer, faster, and more observable releases.
Scale Requirements
- Pipelines: 120 active pipelines, 35 Airflow DAGs, 20 Spark jobs, 200+ dbt models
- Data volume: 8 TB/day batch + 150K Kafka events/sec peak
- Deployment frequency: 20-30 production releases/day across data platform repos
- Latency targets: batch SLA < 45 minutes; streaming freshness < 3 minutes
- Reliability target: 99.9% successful scheduled runs per month
- Recovery target: rollback or hotfix within 15 minutes
Requirements
- Design a deployment pipeline for Airflow DAGs, dbt transformations, and Spark jobs with clear promotion from dev to staging to production.
- Add automated validation: unit tests, SQL tests, schema compatibility checks, data quality gates, and infrastructure policy checks.
- Support idempotent deployments, versioned artifacts, and rollback to the last known-good release.
- Prevent bad releases from impacting downstream SLAs or corrupting warehouse tables.
- Define how secrets, environment-specific configs, and infrastructure changes are managed.
- Include monitoring for deployment health, pipeline health, and post-deploy regressions.
- Explain how you would improve developer velocity without weakening controls.
Constraints
- AWS-first environment with existing EKS, S3, Snowflake, Airflow 2.x, dbt Core, and Spark on EMR
- Small platform team: 5 data engineers, 1 platform engineer
- Incremental budget increase capped at $15K/month
- SOX-style change controls for finance datasets; all production changes must be auditable
- Some legacy DAGs cannot be rewritten immediately and must coexist with the new process