Context
Northbeam Analytics runs batch and near-real-time data pipelines on AWS using Airflow, EMR Serverless, S3, and Snowflake. Today, infrastructure for DAGs, IAM roles, S3 buckets, and compute environments is created manually across dev, staging, and prod, causing drift, failed deployments, and inconsistent access controls.
You need to design how the team should manage pipeline infrastructure as code with Terraform so new data pipelines can be provisioned reproducibly, reviewed through Git, and promoted safely across environments.
Scale Requirements
- Environments: 3 isolated environments (dev, staging, prod)
- Pipelines: 120 Airflow DAGs, 35 batch Spark jobs, 8 streaming jobs
- Deployments: 20-30 Terraform applies per week
- Storage: 1.5 PB in S3 across raw, staging, and curated zones
- Latency target: Infrastructure changes promoted to prod within 30 minutes after approval
- Team size: 10 data engineers, 2 platform engineers
Requirements
- Define a Terraform structure for reusable modules covering S3, IAM, Airflow connections, EMR Serverless applications, and Snowflake objects.
- Support environment-specific configuration without duplicating code.
- Design CI/CD for
terraform fmt, validate, plan, policy checks, and controlled apply.
- Manage remote state, state locking, and secret handling securely.
- Prevent destructive changes to production data stores and shared pipeline resources.
- Include strategies for drift detection, module versioning, and rollback.
- Explain how Terraform changes integrate with pipeline orchestration and deployment workflows.
Constraints
- AWS is the primary cloud; Terraform Cloud is not approved.
- Secrets must remain in AWS Secrets Manager and cannot be stored in state in plaintext.
- Production changes require approval and audit logs.
- Monthly platform tooling budget is capped at $8K.
- SOC 2 controls require least-privilege IAM and change traceability.