Context
Databricks runs reusable Terraform modules to provision pipeline infrastructure across dev, staging, and prod: Databricks Workflows, Delta Live Tables pipelines, Unity Catalog objects, clusters, and cloud storage. Today, multiple engineers and CI jobs apply changes from local machines and GitHub Actions, causing state drift, failed concurrent applies, and inconsistent promotion between environments.
You need to design a Terraform state management approach for shared infrastructure code that supports safe locking, remote backends, environment isolation, and automated delivery for Databricks pipeline platforms.
Scale Requirements
- Teams: 25 platform and DevOps engineers across 8 product squads
- Terraform runs: ~300 plan/apply operations per day across all environments
- Modules: 40+ reusable modules, 120+ workspaces/stacks
- Latency target: lock acquisition or failure detection within 30 seconds
- Recovery target: stale lock resolution within 15 minutes
- State size: 10-50 MB per environment, growing with Unity Catalog and job resources
Requirements
- Design a remote backend strategy for Terraform state used by reusable Databricks infrastructure modules.
- Prevent concurrent mutation of shared state during CI/CD and manual operations.
- Isolate state by environment, region, and platform domain while preserving module reuse.
- Support promotion of Databricks pipeline resources from dev to prod through Git-based workflows.
- Define how secrets, provider credentials, and backend configuration are managed securely.
- Describe rollback, drift detection, and disaster recovery for corrupted or orphaned state.
- Include observability for failed plans, lock contention, stale locks, and unauthorized state access.
Constraints
- Primary cloud is AWS; Databricks workspaces exist in 3 regions.
- Use Databricks Asset Bundles and Databricks Workflows where appropriate instead of generic orchestration placeholders.
- Compliance requires encryption at rest, audit logs for state access, and least-privilege IAM.
- Budget does not allow introducing a large commercial control plane beyond existing Databricks and AWS services.
- The solution must support both engineer-triggered runs and CI/CD automation.