Context
You’re on the Data Platform team at PayWave, a large fintech that powers card-not-present payments for ~120K merchants. PayWave processes customer checkout and risk events from web/mobile SDKs and backend services to drive real-time fraud scoring, merchant dashboards, and revenue reporting. The business is sensitive to data correctness: duplicate “payment_authorized” or “chargeback_created” events can inflate GMV, distort fraud model features, and trigger incorrect merchant alerts.
Last night, a deploy introduced a bug in one of the logging libraries used by the Checkout API. For 4 hours, the service emitted duplicate events (same logical event, repeated 2–5 times) into the event stream. The bug is now fixed at the source, but the duplicates have already landed in your lake/warehouse and have been consumed by downstream pipelines. Leadership asks you to clean up the duplicates without downtime: dashboards must remain available, fraud scoring must keep running, and you must be able to explain and audit what changed.
Current architecture (simplified)
- Producers (Checkout API, Risk service, Web SDK) publish JSON events to Kafka.
- Spark Structured Streaming consumes Kafka and writes to S3 in Parquet (raw bronze).
- Snowpipe loads raw Parquet into Snowflake (
raw.events).
- dbt builds curated tables (
analytics.payments, analytics.fraud_features) incrementally every 15 minutes.
- Airflow orchestrates dbt runs, backfills, and data quality checks.
Scale requirements
- Peak ingestion: 250K events/sec (normal), up to 400K events/sec during campaigns.
- Event size: ~1.2 KB average JSON.
- Daily volume: ~8–12 TB raw Parquet; ~4–6 TB in Snowflake raw.
- SLA: curated tables queryable within <10 minutes of event time.
- Incident window: 4 hours, duplicates factor 2–5x, affecting ~30% of traffic.
- Retention: raw Kafka 3 days, S3 bronze 180 days, Snowflake raw 90 days.
Data characteristics
Event schema (core fields)
| Field | Type | Notes |
|---|
| event_id | string | UUID generated by producer (not reliable during incident; duplicates may share or may differ) |
| event_type | string | e.g., payment_authorized, payment_captured, refund_issued |
| merchant_id | string | high cardinality |
| payment_id | string | stable business key for payment lifecycle |
| amount_cents | int | |
| currency | string | |
| event_ts | timestamp | producer timestamp |
| ingest_ts | timestamp | set by streaming job |
| payload | variant/json | nested metadata |
Quality issues introduced by the bug
- Duplicates are not guaranteed to have identical
event_id.
- Duplicates are logically identical by a composite of business fields (e.g.,
payment_id, event_type, amount_cents, event_ts), but may differ in ingest_ts and minor payload fields.
- Events can arrive late (up to 2 hours) due to mobile offline buffering.
Your task
Design a zero-downtime remediation plan that:
- Stops further duplication impact (prevent new duplicates from being materialized) while keeping streaming online.
- Identifies the affected time window precisely and quantifies impact (counts by event_type/merchant, downstream table deltas).
- Deduplicates raw and curated datasets for the 4-hour window, without breaking consumers or requiring a full reload.
- Ensures idempotent reprocessing (safe to retry) and handles late-arriving data that overlaps the incident window.
- Provides auditability: what was removed/kept, and reproducible logic.
- Maintains or improves data quality guarantees going forward (e.g., exactly-once semantics at the curated layer).
Non-functional requirements
- No downtime for merchant dashboards and fraud scoring (reads must continue).
- Minimal backfill cost: avoid reprocessing more than necessary; target <20% increase in daily compute.
- Operational safety: changes must be reversible (rollback plan).
- Compliance: changes must be auditable (SOX-like controls for revenue reporting).
Constraints
- You must keep Kafka topics and schemas stable for existing consumers.
- Snowflake is the system of record for analytics; S3 bronze is the system of record for raw.
- Team skill set: strong in Spark, dbt, Snowflake, Airflow; limited Flink expertise.
- You cannot pause Snowpipe for more than 5 minutes.
What we’re evaluating
- Your approach to deduplication when there is no perfect unique id.
- How you design incremental, backfillable pipelines.
- How you prevent recurrence via idempotency, constraints, and quality checks.
- How you execute a remediation with safe cutover, monitoring, and rollback.
Deliverables in your answer:
- A proposed end-to-end remediation plan (including cutover strategy).
- Deduplication keys/logic and how you handle edge cases (late arrivals, partial payload differences).
- How you would update Spark/dbt models to be robust.
- Monitoring/alerting and an operational runbook outline.