Context
FinSight, a B2B SaaS analytics company, discovered a reporting bug in its revenue pipeline: refunds were excluded from net revenue for 14 months due to an incorrect transformation in a dbt model. The current stack uses Kafka for event ingestion, Airflow for orchestration, Spark on Databricks for batch processing, and Snowflake for serving BI dashboards. Leadership needs corrected historical reports without breaking downstream consumers or double-counting data.
Scale Requirements
- Historical window: 14 months of data
- Volume: 3.2B transaction events, ~28 TB raw Parquet in S3
- Daily ingest: 8M new events/day continues during backfill
- Latency target: corrected tables available within 48 hours
- Serving SLA: dashboard freshness for current-day data must remain < 30 minutes
- Retention: raw immutable data stored for 2 years
Requirements
- Design a backfill strategy that recomputes historical aggregates after the bug fix while keeping daily production loads running.
- Ensure idempotent reprocessing so reruns do not create duplicates or inconsistent metrics.
- Define how to isolate corrected data before cutover, validate it against current production, and promote it safely.
- Handle partitioned reprocessing by event_date/account_id to reduce blast radius and enable parallel execution.
- Include data quality checks for row-count reconciliation, refund-rate validation, and aggregate diffs versus source-of-truth finance exports.
- Describe orchestration, checkpointing, retry behavior, and rollback if the backfill introduces regressions.
- Explain how downstream BI models and consumers will be notified of metric restatements.
Constraints
- AWS-based infrastructure must be reused; no major platform migration.
- Incremental cloud spend for the backfill should stay under $15K.
- Finance requires an auditable record of what changed, when, and why.
- PII cannot be copied into ad hoc environments; all processing must remain in approved S3/Snowflake accounts.
- The team has 3 data engineers, so the solution should minimize operational complexity.