Least-Privilege Multi-Cloud Data Pipelines

Context

Databricks operates regulated customer data pipelines across AWS, Azure, and GCP. Today, ingestion and transformation jobs use broad cloud IAM roles and shared service principals, which creates audit risk, over-permissioned access to storage, and weak isolation between dev, staging, and prod.

You are asked to design a least-privilege access model for Databricks pipelines that run batch and streaming workloads using Delta Live Tables / Lakeflow Declarative Pipelines, Databricks Workflows, Unity Catalog, and cloud-native identities. The design must support secure cross-cloud ingestion, environment isolation, and auditable access to bronze, silver, and gold data products.

Scale Requirements

Pipelines: 1,200 scheduled batch pipelines and 180 continuous streaming pipelines
Data volume: 9 PB total in object storage, 45 TB/day new data
Tenancy: 60 internal platform teams, 300+ service identities, 3 environments per cloud
Latency: streaming SLA < 3 minutes end-to-end; batch completion by 6 AM local region
Auditability: all access decisions traceable within 15 minutes

Requirements

Design identity and access boundaries for Databricks Workspaces, Unity Catalog metastores, catalogs, schemas, tables, volumes, and external locations across AWS, Azure, and GCP.
Enforce least privilege for pipeline execution identities so each pipeline can read only required sources and write only approved targets.
Support secretless authentication where possible using AWS IAM roles, Azure Managed Identities, and GCP service accounts with Databricks credential passthrough equivalents or workload identity patterns.
Define how Databricks Workflows and Lakeflow Declarative Pipelines obtain short-lived credentials for cloud storage and downstream systems.
Include controls for schema evolution, data quality failures, and restricted promotion from bronze to silver/gold.
Provide monitoring, alerting, and automated remediation for privilege drift, failed policy enforcement, and unauthorized access attempts.

Constraints

Must use Databricks-native governance first: Unity Catalog, service principals, cluster policies, compute policies, and audit logs.
No long-lived static cloud keys stored in notebooks or pipeline configs.
Must satisfy SOC 2, GDPR, and customer-managed VPC/VNet deployment patterns.
Incremental platform budget increase is capped at $40K/month.

Context

Scale Requirements

Pipelines: 1,200 scheduled batch pipelines and 180 continuous streaming pipelines
Data volume: 9 PB total in object storage, 45 TB/day new data
Tenancy: 60 internal platform teams, 300+ service identities, 3 environments per cloud
Latency: streaming SLA < 3 minutes end-to-end; batch completion by 6 AM local region
Auditability: all access decisions traceable within 15 minutes

Requirements

Design identity and access boundaries for Databricks Workspaces, Unity Catalog metastores, catalogs, schemas, tables, volumes, and external locations across AWS, Azure, and GCP.
Enforce least privilege for pipeline execution identities so each pipeline can read only required sources and write only approved targets.
Support secretless authentication where possible using AWS IAM roles, Azure Managed Identities, and GCP service accounts with Databricks credential passthrough equivalents or workload identity patterns.
Define how Databricks Workflows and Lakeflow Declarative Pipelines obtain short-lived credentials for cloud storage and downstream systems.
Include controls for schema evolution, data quality failures, and restricted promotion from bronze to silver/gold.
Provide monitoring, alerting, and automated remediation for privilege drift, failed policy enforcement, and unauthorized access attempts.

Constraints

Must use Databricks-native governance first: Unity Catalog, service principals, cluster policies, compute policies, and audit logs.
No long-lived static cloud keys stored in notebooks or pipeline configs.
Must satisfy SOC 2, GDPR, and customer-managed VPC/VNet deployment patterns.
Incremental platform budget increase is capped at $40K/month.

Context

Scale Requirements

Pipelines: 1,200 scheduled batch pipelines and 180 continuous streaming pipelines
Data volume: 9 PB total in object storage, 45 TB/day new data
Tenancy: 60 internal platform teams, 300+ service identities, 3 environments per cloud
Latency: streaming SLA < 3 minutes end-to-end; batch completion by 6 AM local region
Auditability: all access decisions traceable within 15 minutes

Requirements

Design identity and access boundaries for Databricks Workspaces, Unity Catalog metastores, catalogs, schemas, tables, volumes, and external locations across AWS, Azure, and GCP.
Enforce least privilege for pipeline execution identities so each pipeline can read only required sources and write only approved targets.
Support secretless authentication where possible using AWS IAM roles, Azure Managed Identities, and GCP service accounts with Databricks credential passthrough equivalents or workload identity patterns.
Define how Databricks Workflows and Lakeflow Declarative Pipelines obtain short-lived credentials for cloud storage and downstream systems.
Include controls for schema evolution, data quality failures, and restricted promotion from bronze to silver/gold.
Provide monitoring, alerting, and automated remediation for privilege drift, failed policy enforcement, and unauthorized access attempts.

Constraints

Must use Databricks-native governance first: Unity Catalog, service principals, cluster policies, compute policies, and audit logs.
No long-lived static cloud keys stored in notebooks or pipeline configs.
Must satisfy SOC 2, GDPR, and customer-managed VPC/VNet deployment patterns.
Incremental platform budget increase is capped at $40K/month.

Context

Scale Requirements

Pipelines: 1,200 scheduled batch pipelines and 180 continuous streaming pipelines
Data volume: 9 PB total in object storage, 45 TB/day new data
Tenancy: 60 internal platform teams, 300+ service identities, 3 environments per cloud
Latency: streaming SLA < 3 minutes end-to-end; batch completion by 6 AM local region
Auditability: all access decisions traceable within 15 minutes

Requirements

Design identity and access boundaries for Databricks Workspaces, Unity Catalog metastores, catalogs, schemas, tables, volumes, and external locations across AWS, Azure, and GCP.
Enforce least privilege for pipeline execution identities so each pipeline can read only required sources and write only approved targets.
Support secretless authentication where possible using AWS IAM roles, Azure Managed Identities, and GCP service accounts with Databricks credential passthrough equivalents or workload identity patterns.
Define how Databricks Workflows and Lakeflow Declarative Pipelines obtain short-lived credentials for cloud storage and downstream systems.
Include controls for schema evolution, data quality failures, and restricted promotion from bronze to silver/gold.
Provide monitoring, alerting, and automated remediation for privilege drift, failed policy enforcement, and unauthorized access attempts.

Constraints

Must use Databricks-native governance first: Unity Catalog, service principals, cluster policies, compute policies, and audit logs.
No long-lived static cloud keys stored in notebooks or pipeline configs.
Must satisfy SOC 2, GDPR, and customer-managed VPC/VNet deployment patterns.
Incremental platform budget increase is capped at $40K/month.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Least-Privilege Multi-Cloud Data Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer

Least-Privilege Multi-Cloud Data Pipelines

Context

Scale Requirements

Requirements

Constraints

Least-Privilege Multi-Cloud Data Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer