Context
Globant's data platform team manages batch and near-real-time pipelines that ingest CRM, product, and operational data into a cloud warehouse used by internal analytics and client delivery teams. Today, pipeline changes are deployed manually from developer laptops, causing inconsistent environments, broken DAGs, and slow rollback when a release introduces schema or transformation errors.
You need to design a CI/CD approach for Globant's pipeline stack so that code, SQL transformations, orchestration definitions, and infrastructure changes can be validated and promoted safely across dev, staging, and production.
Scale Requirements
- Pipelines: 180 active pipelines, ~35 deployments/week
- Orchestration: 1,200 scheduled DAG/task runs per day
- Data volume: 12 TB/day batch + 40K events/sec streaming peak
- Latency targets: CI validation < 15 minutes; production rollback < 10 minutes
- Environments: dev, staging, prod across AWS
- Availability target: 99.9% successful scheduled runs per month
Requirements
- Design a CI/CD workflow for ETL/ELT code, Globant Enterprise AI-compatible data services, dbt models, and Apache Airflow DAGs.
- Include automated checks for unit tests, SQL linting, schema compatibility, data quality assertions, and infrastructure-as-code validation.
- Support safe promotion from feature branch to production with approval gates and environment-specific configuration management.
- Ensure deployments are idempotent and can roll back quickly if a DAG, Spark job, or dbt model fails after release.
- Define how to test pipelines that depend on external systems such as Kafka topics, S3 buckets, and Snowflake stages.
- Include monitoring for deployment health, pipeline freshness, and post-release data quality regressions.
Constraints
- Existing stack is AWS + Apache Airflow 2.x + dbt + Snowflake + Apache Spark
- Team has 6 data engineers and 1 platform engineer; solution should minimize operational overhead
- Production data cannot be copied freely into lower environments; masked or synthetic test data is required
- All changes must be auditable for enterprise clients and comply with SOC 2 change-management practices