Context
NovaShop, a mobile commerce platform with 30M monthly active users, discovered that a client-side tracking bug dropped product_view, add_to_cart, and checkout_start events for 9 hours during a major feature launch. The current stack uses mobile/web SDKs feeding Kafka, Spark Structured Streaming for normalization, S3 as the raw lake, and Snowflake for analytics. Product and finance teams need historical metrics corrected without double-counting or corrupting downstream fact tables.
You need to design a backfill pipeline that reconstructs the missing events from available source systems such as API logs, order service logs, and client retry buffers, then safely reprocesses them into analytics tables.
Scale Requirements
- Affected window: 9 hours of missing events
- Peak traffic: 220K events/sec, 70K avg
- Backfill volume: up to 2.8B reconstructed events, ~5.5 TB compressed
- Recovery SLA: backfill completed within 18 hours
- Serving SLA: corrected warehouse tables queryable within 30 minutes of each backfill batch
- Retention: raw replay artifacts retained for 180 days
Requirements
- Design a replay/backfill pipeline that reconstructs missing events from authoritative upstream logs.
- Ensure idempotent writes so replayed events do not duplicate existing records in S3 or Snowflake.
- Support mixed processing: historical batch backfill plus ongoing real-time ingestion with no downtime.
- Define validation checks to compare reconstructed counts against unaffected control periods and downstream business metrics.
- Update derived models such as sessions, funnels, and attribution tables after raw event correction.
- Provide observability for replay progress, data quality, and warehouse reconciliation.
- Describe rollback and re-run strategy if a bad reconstruction rule is deployed.
Constraints
- AWS is the required cloud; existing services are MSK, EMR, S3, Airflow, Snowflake, and dbt.
- Incremental budget is capped at $18K for the recovery effort.
- Event schemas evolved during the launch; some fields are nullable in historical logs.
- GDPR deletion requests must remain honored during replay and in corrected downstream tables.