Backfill Six Months in Delta Pipeline

Context

Databricks runs a Spark-based batch pipeline on Delta Lake to build daily customer usage tables consumed by finance and product analytics. A bug in a transformation introduced incorrect aggregates, and you need to backfill the last 6 months of data without breaking downstream SLAs or corrupting current production outputs.

The current pipeline uses Databricks Workflows for orchestration, Auto Loader for landing raw files into Bronze, and Delta Live Tables or scheduled Spark jobs to build Silver and Gold tables in Unity Catalog. Design a safe, repeatable backfill strategy.

Scale Requirements

Historical window: 180 days of data
Input volume: 2.5 TB/day raw JSON and Parquet, ~450 TB total historical scan
Daily records: ~8 billion events/day
Target latency: complete backfill within 72 hours
Freshness constraint: ongoing daily production pipeline must remain under 90 minutes end-to-end
Storage: Bronze retained for 13 months; Silver/Gold stored in Delta Lake with Change Data Feed enabled

Requirements

Design how you would partition and orchestrate the backfill across 180 days using Databricks Workflows.
Explain how you would isolate backfill writes from the live pipeline and safely merge results into production Delta tables.
Ensure the process is idempotent so rerunning any failed date range does not create duplicates or inconsistent aggregates.
Describe how you would handle schema evolution, late-arriving records, and dimension table changes during the backfill.
Define data quality checks before promoting backfilled data to downstream consumers.
Include rollback, retry, and observability strategies using Databricks-native capabilities where possible.

Constraints

Use Databricks on AWS with Unity Catalog-managed Delta tables.
No full downtime for downstream BI consumers.
Cluster budget is capped at $18K for the backfill run.
Finance reports require reproducible numbers; all changes must be auditable.
You may assume source data in cloud object storage is complete, but some partitions contain duplicate files and occasional schema drift.

Context

Scale Requirements

Historical window: 180 days of data
Input volume: 2.5 TB/day raw JSON and Parquet, ~450 TB total historical scan
Daily records: ~8 billion events/day
Target latency: complete backfill within 72 hours
Freshness constraint: ongoing daily production pipeline must remain under 90 minutes end-to-end
Storage: Bronze retained for 13 months; Silver/Gold stored in Delta Lake with Change Data Feed enabled

Requirements

Design how you would partition and orchestrate the backfill across 180 days using Databricks Workflows.
Explain how you would isolate backfill writes from the live pipeline and safely merge results into production Delta tables.
Ensure the process is idempotent so rerunning any failed date range does not create duplicates or inconsistent aggregates.
Describe how you would handle schema evolution, late-arriving records, and dimension table changes during the backfill.
Define data quality checks before promoting backfilled data to downstream consumers.
Include rollback, retry, and observability strategies using Databricks-native capabilities where possible.

Constraints

Use Databricks on AWS with Unity Catalog-managed Delta tables.
No full downtime for downstream BI consumers.
Cluster budget is capped at $18K for the backfill run.
Finance reports require reproducible numbers; all changes must be auditable.
You may assume source data in cloud object storage is complete, but some partitions contain duplicate files and occasional schema drift.

Context

Scale Requirements

Historical window: 180 days of data
Input volume: 2.5 TB/day raw JSON and Parquet, ~450 TB total historical scan
Daily records: ~8 billion events/day
Target latency: complete backfill within 72 hours
Freshness constraint: ongoing daily production pipeline must remain under 90 minutes end-to-end
Storage: Bronze retained for 13 months; Silver/Gold stored in Delta Lake with Change Data Feed enabled

Requirements

Design how you would partition and orchestrate the backfill across 180 days using Databricks Workflows.
Explain how you would isolate backfill writes from the live pipeline and safely merge results into production Delta tables.
Ensure the process is idempotent so rerunning any failed date range does not create duplicates or inconsistent aggregates.
Describe how you would handle schema evolution, late-arriving records, and dimension table changes during the backfill.
Define data quality checks before promoting backfilled data to downstream consumers.
Include rollback, retry, and observability strategies using Databricks-native capabilities where possible.

Constraints

Use Databricks on AWS with Unity Catalog-managed Delta tables.
No full downtime for downstream BI consumers.
Cluster budget is capped at $18K for the backfill run.
Finance reports require reproducible numbers; all changes must be auditable.
You may assume source data in cloud object storage is complete, but some partitions contain duplicate files and occasional schema drift.

Context

Scale Requirements

Historical window: 180 days of data
Input volume: 2.5 TB/day raw JSON and Parquet, ~450 TB total historical scan
Daily records: ~8 billion events/day
Target latency: complete backfill within 72 hours
Freshness constraint: ongoing daily production pipeline must remain under 90 minutes end-to-end
Storage: Bronze retained for 13 months; Silver/Gold stored in Delta Lake with Change Data Feed enabled

Requirements

Design how you would partition and orchestrate the backfill across 180 days using Databricks Workflows.
Explain how you would isolate backfill writes from the live pipeline and safely merge results into production Delta tables.
Ensure the process is idempotent so rerunning any failed date range does not create duplicates or inconsistent aggregates.
Describe how you would handle schema evolution, late-arriving records, and dimension table changes during the backfill.
Define data quality checks before promoting backfilled data to downstream consumers.
Include rollback, retry, and observability strategies using Databricks-native capabilities where possible.

Constraints

Use Databricks on AWS with Unity Catalog-managed Delta tables.
No full downtime for downstream BI consumers.
Cluster budget is capped at $18K for the backfill run.
Finance reports require reproducible numbers; all changes must be auditable.
You may assume source data in cloud object storage is complete, but some partitions contain duplicate files and occasional schema drift.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Backfill Six Months in Delta Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Backfill Six Months in Delta Pipeline

Context

Scale Requirements

Requirements

Constraints

Backfill Six Months in Delta Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer