Context
FinLedge, a B2B payments company, runs 120 production data pipelines that ingest PostgreSQL CDC, S3 batch files, and Kafka events into Snowflake. Today, deployments are manual: Airflow DAGs, dbt models, and Spark jobs are pushed directly to production, causing broken dependencies, schema drift, and inconsistent rollback behavior.
You need to design a CI/CD pipeline for the data platform so engineering teams can safely test, deploy, and monitor changes to ETL/ELT workflows with minimal downtime.
Scale Requirements
- Pipelines: 120 production workflows, growing 10% per quarter
- Deploy frequency: 30-50 merges/day across DAGs, dbt, and Spark code
- Data volume: 8 TB/day batch + 150K events/sec streaming peak
- Latency targets: CI validation < 15 minutes; production deployment < 10 minutes; rollback < 5 minutes
- Environments: dev, staging, prod across AWS accounts
- Retention: CI artifacts and logs retained for 90 days
Requirements
- Design a CI/CD process for Airflow DAGs, dbt transformations, and Spark jobs using Git-based workflows.
- Validate code quality with unit tests, SQL tests, schema checks, and pipeline dependency validation before merge.
- Support environment-specific configuration, secrets management, and promotion from dev to staging to prod.
- Ensure idempotent deployments, safe rollback, and versioned artifacts for DAGs, container images, and dbt manifests.
- Prevent bad releases from corrupting downstream tables or breaking scheduled jobs.
- Include monitoring for deployment failures, data quality regressions, and runtime health after release.
- Support both batch and streaming jobs without pausing critical financial reporting pipelines.
Constraints
- AWS-first stack; existing services include MWAA, ECR, S3, EMR, and Snowflake
- Team has 5 data engineers and 1 platform engineer
- Monthly incremental tooling budget is capped at $15K
- SOX-style auditability is required for production changes
- Production secrets cannot be exposed in CI runners