Context
Databricks runs a Spark-based batch pipeline on Delta Lake to build daily customer usage tables consumed by finance and product analytics. A bug in a transformation introduced incorrect aggregates, and you need to backfill the last 6 months of data without breaking downstream SLAs or corrupting current production outputs.
The current pipeline uses Databricks Workflows for orchestration, Auto Loader for landing raw files into Bronze, and Delta Live Tables or scheduled Spark jobs to build Silver and Gold tables in Unity Catalog. Design a safe, repeatable backfill strategy.
Scale Requirements
- Historical window: 180 days of data
- Input volume: 2.5 TB/day raw JSON and Parquet, ~450 TB total historical scan
- Daily records: ~8 billion events/day
- Target latency: complete backfill within 72 hours
- Freshness constraint: ongoing daily production pipeline must remain under 90 minutes end-to-end
- Storage: Bronze retained for 13 months; Silver/Gold stored in Delta Lake with Change Data Feed enabled
Requirements
- Design how you would partition and orchestrate the backfill across 180 days using Databricks Workflows.
- Explain how you would isolate backfill writes from the live pipeline and safely merge results into production Delta tables.
- Ensure the process is idempotent so rerunning any failed date range does not create duplicates or inconsistent aggregates.
- Describe how you would handle schema evolution, late-arriving records, and dimension table changes during the backfill.
- Define data quality checks before promoting backfilled data to downstream consumers.
- Include rollback, retry, and observability strategies using Databricks-native capabilities where possible.
Constraints
- Use Databricks on AWS with Unity Catalog-managed Delta tables.
- No full downtime for downstream BI consumers.
- Cluster budget is capped at $18K for the backfill run.
- Finance reports require reproducible numbers; all changes must be auditable.
- You may assume source data in cloud object storage is complete, but some partitions contain duplicate files and occasional schema drift.