Terraform State for Databricks Pipelines

Context

Databricks runs reusable Terraform modules to provision pipeline infrastructure across dev, staging, and prod: Databricks Workflows, Delta Live Tables pipelines, Unity Catalog objects, clusters, and cloud storage. Today, multiple engineers and CI jobs apply changes from local machines and GitHub Actions, causing state drift, failed concurrent applies, and inconsistent promotion between environments.

You need to design a Terraform state management approach for shared infrastructure code that supports safe locking, remote backends, environment isolation, and automated delivery for Databricks pipeline platforms.

Scale Requirements

Teams: 25 platform and DevOps engineers across 8 product squads
Terraform runs: ~300 plan/apply operations per day across all environments
Modules: 40+ reusable modules, 120+ workspaces/stacks
Latency target: lock acquisition or failure detection within 30 seconds
Recovery target: stale lock resolution within 15 minutes
State size: 10-50 MB per environment, growing with Unity Catalog and job resources

Requirements

Design a remote backend strategy for Terraform state used by reusable Databricks infrastructure modules.
Prevent concurrent mutation of shared state during CI/CD and manual operations.
Isolate state by environment, region, and platform domain while preserving module reuse.
Support promotion of Databricks pipeline resources from dev to prod through Git-based workflows.
Define how secrets, provider credentials, and backend configuration are managed securely.
Describe rollback, drift detection, and disaster recovery for corrupted or orphaned state.
Include observability for failed plans, lock contention, stale locks, and unauthorized state access.

Constraints

Primary cloud is AWS; Databricks workspaces exist in 3 regions.
Use Databricks Asset Bundles and Databricks Workflows where appropriate instead of generic orchestration placeholders.
Compliance requires encryption at rest, audit logs for state access, and least-privilege IAM.
Budget does not allow introducing a large commercial control plane beyond existing Databricks and AWS services.
The solution must support both engineer-triggered runs and CI/CD automation.

Context

Scale Requirements

Teams: 25 platform and DevOps engineers across 8 product squads
Terraform runs: ~300 plan/apply operations per day across all environments
Modules: 40+ reusable modules, 120+ workspaces/stacks
Latency target: lock acquisition or failure detection within 30 seconds
Recovery target: stale lock resolution within 15 minutes
State size: 10-50 MB per environment, growing with Unity Catalog and job resources

Requirements

Design a remote backend strategy for Terraform state used by reusable Databricks infrastructure modules.
Prevent concurrent mutation of shared state during CI/CD and manual operations.
Isolate state by environment, region, and platform domain while preserving module reuse.
Support promotion of Databricks pipeline resources from dev to prod through Git-based workflows.
Define how secrets, provider credentials, and backend configuration are managed securely.
Describe rollback, drift detection, and disaster recovery for corrupted or orphaned state.
Include observability for failed plans, lock contention, stale locks, and unauthorized state access.

Constraints

Primary cloud is AWS; Databricks workspaces exist in 3 regions.
Use Databricks Asset Bundles and Databricks Workflows where appropriate instead of generic orchestration placeholders.
Compliance requires encryption at rest, audit logs for state access, and least-privilege IAM.
Budget does not allow introducing a large commercial control plane beyond existing Databricks and AWS services.
The solution must support both engineer-triggered runs and CI/CD automation.

Context

Scale Requirements

Teams: 25 platform and DevOps engineers across 8 product squads
Terraform runs: ~300 plan/apply operations per day across all environments
Modules: 40+ reusable modules, 120+ workspaces/stacks
Latency target: lock acquisition or failure detection within 30 seconds
Recovery target: stale lock resolution within 15 minutes
State size: 10-50 MB per environment, growing with Unity Catalog and job resources

Requirements

Design a remote backend strategy for Terraform state used by reusable Databricks infrastructure modules.
Prevent concurrent mutation of shared state during CI/CD and manual operations.
Isolate state by environment, region, and platform domain while preserving module reuse.
Support promotion of Databricks pipeline resources from dev to prod through Git-based workflows.
Define how secrets, provider credentials, and backend configuration are managed securely.
Describe rollback, drift detection, and disaster recovery for corrupted or orphaned state.
Include observability for failed plans, lock contention, stale locks, and unauthorized state access.

Constraints

Primary cloud is AWS; Databricks workspaces exist in 3 regions.
Use Databricks Asset Bundles and Databricks Workflows where appropriate instead of generic orchestration placeholders.
Compliance requires encryption at rest, audit logs for state access, and least-privilege IAM.
Budget does not allow introducing a large commercial control plane beyond existing Databricks and AWS services.
The solution must support both engineer-triggered runs and CI/CD automation.

Context

Scale Requirements

Teams: 25 platform and DevOps engineers across 8 product squads
Terraform runs: ~300 plan/apply operations per day across all environments
Modules: 40+ reusable modules, 120+ workspaces/stacks
Latency target: lock acquisition or failure detection within 30 seconds
Recovery target: stale lock resolution within 15 minutes
State size: 10-50 MB per environment, growing with Unity Catalog and job resources

Requirements

Design a remote backend strategy for Terraform state used by reusable Databricks infrastructure modules.
Prevent concurrent mutation of shared state during CI/CD and manual operations.
Isolate state by environment, region, and platform domain while preserving module reuse.
Support promotion of Databricks pipeline resources from dev to prod through Git-based workflows.
Define how secrets, provider credentials, and backend configuration are managed securely.
Describe rollback, drift detection, and disaster recovery for corrupted or orphaned state.
Include observability for failed plans, lock contention, stale locks, and unauthorized state access.

Constraints

Primary cloud is AWS; Databricks workspaces exist in 3 regions.
Use Databricks Asset Bundles and Databricks Workflows where appropriate instead of generic orchestration placeholders.
Compliance requires encryption at rest, audit logs for state access, and least-privilege IAM.
Budget does not allow introducing a large commercial control plane beyond existing Databricks and AWS services.
The solution must support both engineer-triggered runs and CI/CD automation.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Terraform State for Databricks Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer

Terraform State for Databricks Pipelines

Context

Scale Requirements

Requirements

Constraints

Terraform State for Databricks Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer