Interview Guides

Implement End-to-End Data Lineage

Medium

Pipelines

Context

Databricks runs multiple Delta Live Tables and Databricks Workflows pipelines that ingest product telemetry, billing events, and CRM data into a Unity Catalog-governed lakehouse. Today, teams can see table-level dependencies in some places, but they cannot reliably answer column-level lineage, downstream impact of schema changes, or which jobs produced a given dashboard metric.

You need to design a lineage implementation for Databricks that captures lineage across batch and streaming pipelines, exposes it to data engineers and governance teams, and supports operational debugging and compliance audits.

Scale Requirements

Sources: 120 upstream systems (Kafka, cloud object storage, SaaS connectors, operational databases)
Pipelines: 1,500 Databricks Workflows jobs and 300 Delta Live Tables / Lakeflow Declarative Pipelines
Tables: 25,000 Unity Catalog tables/views, including 4,000 streaming tables
Volume: ~80 TB/day ingested, ~12M lineage edges, ~250K lineage events/day
Latency target: lineage visible within 5 minutes of a pipeline run or schema change
Retention: 400 days for audit history

Requirements

Capture table-level and column-level lineage for Delta tables, views, materialized views, and streaming tables in Unity Catalog.
Track lineage across Databricks Workflows, SQL Warehouses, notebooks, Auto Loader, and Delta Live Tables / Lakeflow pipelines.
Support impact analysis for schema changes, failed upstream datasets, and PII-tagged columns.
Store lineage history as a queryable graph or edge model for audit and debugging.
Include data quality signals so consumers can see whether upstream datasets passed expectations.
Provide monitoring, alerting, and replay/backfill for missed lineage events.
Design for idempotent ingestion and support reprocessing of historical pipeline metadata.

Constraints

Prefer native Databricks capabilities first: Unity Catalog, system tables, Delta, Workflows, Auto Loader, SQL, and Lakeflow.
Must support AWS, Azure, and GCP workspaces.
PII metadata must remain inside the Databricks account boundary.
Incremental infrastructure budget is limited; avoid introducing a separate graph database unless clearly justified.

Implement End-to-End Data Lineage

Medium

Pipelines

Context

Scale Requirements

Sources: 120 upstream systems (Kafka, cloud object storage, SaaS connectors, operational databases)
Pipelines: 1,500 Databricks Workflows jobs and 300 Delta Live Tables / Lakeflow Declarative Pipelines
Tables: 25,000 Unity Catalog tables/views, including 4,000 streaming tables
Volume: ~80 TB/day ingested, ~12M lineage edges, ~250K lineage events/day
Latency target: lineage visible within 5 minutes of a pipeline run or schema change
Retention: 400 days for audit history

Requirements

Capture table-level and column-level lineage for Delta tables, views, materialized views, and streaming tables in Unity Catalog.
Track lineage across Databricks Workflows, SQL Warehouses, notebooks, Auto Loader, and Delta Live Tables / Lakeflow pipelines.
Support impact analysis for schema changes, failed upstream datasets, and PII-tagged columns.
Store lineage history as a queryable graph or edge model for audit and debugging.
Include data quality signals so consumers can see whether upstream datasets passed expectations.
Provide monitoring, alerting, and replay/backfill for missed lineage events.
Design for idempotent ingestion and support reprocessing of historical pipeline metadata.

Constraints

Prefer native Databricks capabilities first: Unity Catalog, system tables, Delta, Workflows, Auto Loader, SQL, and Lakeflow.
Must support AWS, Azure, and GCP workspaces.
PII metadata must remain inside the Databricks account boundary.
Incremental infrastructure budget is limited; avoid introducing a separate graph database unless clearly justified.

Your Answer

Implement End-to-End Data Lineage

Medium

Pipelines

Context

Scale Requirements

Sources: 120 upstream systems (Kafka, cloud object storage, SaaS connectors, operational databases)
Pipelines: 1,500 Databricks Workflows jobs and 300 Delta Live Tables / Lakeflow Declarative Pipelines
Tables: 25,000 Unity Catalog tables/views, including 4,000 streaming tables
Volume: ~80 TB/day ingested, ~12M lineage edges, ~250K lineage events/day
Latency target: lineage visible within 5 minutes of a pipeline run or schema change
Retention: 400 days for audit history

Requirements

Capture table-level and column-level lineage for Delta tables, views, materialized views, and streaming tables in Unity Catalog.
Track lineage across Databricks Workflows, SQL Warehouses, notebooks, Auto Loader, and Delta Live Tables / Lakeflow pipelines.
Support impact analysis for schema changes, failed upstream datasets, and PII-tagged columns.
Store lineage history as a queryable graph or edge model for audit and debugging.
Include data quality signals so consumers can see whether upstream datasets passed expectations.
Provide monitoring, alerting, and replay/backfill for missed lineage events.
Design for idempotent ingestion and support reprocessing of historical pipeline metadata.

Constraints

Prefer native Databricks capabilities first: Unity Catalog, system tables, Delta, Workflows, Auto Loader, SQL, and Lakeflow.
Must support AWS, Azure, and GCP workspaces.
PII metadata must remain inside the Databricks account boundary.
Incremental infrastructure budget is limited; avoid introducing a separate graph database unless clearly justified.

Implement End-to-End Data Lineage

Medium

Pipelines

Context

Scale Requirements

Sources: 120 upstream systems (Kafka, cloud object storage, SaaS connectors, operational databases)
Pipelines: 1,500 Databricks Workflows jobs and 300 Delta Live Tables / Lakeflow Declarative Pipelines
Tables: 25,000 Unity Catalog tables/views, including 4,000 streaming tables
Volume: ~80 TB/day ingested, ~12M lineage edges, ~250K lineage events/day
Latency target: lineage visible within 5 minutes of a pipeline run or schema change
Retention: 400 days for audit history

Requirements

Capture table-level and column-level lineage for Delta tables, views, materialized views, and streaming tables in Unity Catalog.
Track lineage across Databricks Workflows, SQL Warehouses, notebooks, Auto Loader, and Delta Live Tables / Lakeflow pipelines.
Support impact analysis for schema changes, failed upstream datasets, and PII-tagged columns.
Store lineage history as a queryable graph or edge model for audit and debugging.
Include data quality signals so consumers can see whether upstream datasets passed expectations.
Provide monitoring, alerting, and replay/backfill for missed lineage events.
Design for idempotent ingestion and support reprocessing of historical pipeline metadata.

Constraints

Prefer native Databricks capabilities first: Unity Catalog, system tables, Delta, Workflows, Auto Loader, SQL, and Lakeflow.
Must support AWS, Azure, and GCP workspaces.
PII metadata must remain inside the Databricks account boundary.
Incremental infrastructure budget is limited; avoid introducing a separate graph database unless clearly justified.