Context
Databricks runs multiple Delta Live Tables and Databricks Workflows pipelines that ingest product telemetry, billing events, and CRM data into a Unity Catalog-governed lakehouse. Today, teams can see table-level dependencies in some places, but they cannot reliably answer column-level lineage, downstream impact of schema changes, or which jobs produced a given dashboard metric.
You need to design a lineage implementation for Databricks that captures lineage across batch and streaming pipelines, exposes it to data engineers and governance teams, and supports operational debugging and compliance audits.
Scale Requirements
- Sources: 120 upstream systems (Kafka, cloud object storage, SaaS connectors, operational databases)
- Pipelines: 1,500 Databricks Workflows jobs and 300 Delta Live Tables / Lakeflow Declarative Pipelines
- Tables: 25,000 Unity Catalog tables/views, including 4,000 streaming tables
- Volume: ~80 TB/day ingested, ~12M lineage edges, ~250K lineage events/day
- Latency target: lineage visible within 5 minutes of a pipeline run or schema change
- Retention: 400 days for audit history
Requirements
- Capture table-level and column-level lineage for Delta tables, views, materialized views, and streaming tables in Unity Catalog.
- Track lineage across Databricks Workflows, SQL Warehouses, notebooks, Auto Loader, and Delta Live Tables / Lakeflow pipelines.
- Support impact analysis for schema changes, failed upstream datasets, and PII-tagged columns.
- Store lineage history as a queryable graph or edge model for audit and debugging.
- Include data quality signals so consumers can see whether upstream datasets passed expectations.
- Provide monitoring, alerting, and replay/backfill for missed lineage events.
- Design for idempotent ingestion and support reprocessing of historical pipeline metadata.
Constraints
- Prefer native Databricks capabilities first: Unity Catalog, system tables, Delta, Workflows, Auto Loader, SQL, and Lakeflow.
- Must support AWS, Azure, and GCP workspaces.
- PII metadata must remain inside the Databricks account boundary.
- Incremental infrastructure budget is limited; avoid introducing a separate graph database unless clearly justified.