Context
FinSight, a fintech analytics company, runs daily ETL pipelines on AWS using Apache Airflow, Spark, S3, and Snowflake to build transaction, balance, and customer reporting tables. A bug in a transformation job caused 45 days of downstream aggregates to be incorrect, and the team needs a repeatable backfill process that can recompute historical data without corrupting current production tables.
Your task is to design a backfilling strategy and supporting pipeline architecture. Explain what backfilling means in this context and how you would safely reprocess historical data for a defined date range while preserving data quality, lineage, and operational stability.
Scale Requirements
- Input volume: 2.5 TB/day of raw Parquet files in S3
- Historical range: Up to 180 days of backfill per request
- Daily records: ~900 million transaction events/day
- Latency target: Backfill results available within 8 hours for a 45-day reprocessing job
- Freshness requirement: Daily production pipeline must continue with <30 minute delay
- Retention: Raw data stored for 2 years; curated warehouse tables retained indefinitely
Requirements
- Define a backfill mechanism for reprocessing a bounded historical date range.
- Ensure reruns are idempotent and do not create duplicate records in Snowflake.
- Separate backfill workloads from scheduled daily production runs.
- Support partition-level reprocessing by
business_date.
- Include validation checks comparing backfilled outputs to source counts and prior snapshots.
- Provide rollback and recovery steps if a backfill introduces bad data.
- Describe orchestration, monitoring, and failure handling for long-running backfill jobs.
Constraints
- Existing stack must remain AWS + Snowflake.
- Team size is 3 data engineers; solution should avoid excessive operational complexity.
- Budget allows temporary compute scaling during backfills, but no permanent 24/7 clusters.
- Financial reporting tables require auditability and reproducible reruns.
- Backfills must not overwrite unaffected partitions or block analyst access to current tables.