Deduplicate 4-Hour Event Spike Live

Context

You’re on the Data Platform team at PayWave, a large fintech that powers card-not-present payments for ~120K merchants. PayWave processes customer checkout and risk events from web/mobile SDKs and backend services to drive real-time fraud scoring, merchant dashboards, and revenue reporting. The business is sensitive to data correctness: duplicate “payment_authorized” or “chargeback_created” events can inflate GMV, distort fraud model features, and trigger incorrect merchant alerts.

Last night, a deploy introduced a bug in one of the logging libraries used by the Checkout API. For 4 hours, the service emitted duplicate events (same logical event, repeated 2–5 times) into the event stream. The bug is now fixed at the source, but the duplicates have already landed in your lake/warehouse and have been consumed by downstream pipelines. Leadership asks you to clean up the duplicates without downtime: dashboards must remain available, fraud scoring must keep running, and you must be able to explain and audit what changed.

Current architecture (simplified)

Producers (Checkout API, Risk service, Web SDK) publish JSON events to Kafka.
Spark Structured Streaming consumes Kafka and writes to S3 in Parquet (raw bronze).
Snowpipe loads raw Parquet into Snowflake (raw.events).
dbt builds curated tables (analytics.payments, analytics.fraud_features) incrementally every 15 minutes.
Airflow orchestrates dbt runs, backfills, and data quality checks.

Scale requirements

Peak ingestion: 250K events/sec (normal), up to 400K events/sec during campaigns.
Event size: ~1.2 KB average JSON.
Daily volume: ~8–12 TB raw Parquet; ~4–6 TB in Snowflake raw.
SLA: curated tables queryable within <10 minutes of event time.
Incident window: 4 hours, duplicates factor 2–5x, affecting ~30% of traffic.
Retention: raw Kafka 3 days, S3 bronze 180 days, Snowflake raw 90 days.

Data characteristics

Event schema (core fields)

Field	Type	Notes
event_id	string	UUID generated by producer (not reliable during incident; duplicates may share or may differ)
event_type	string	e.g., `payment_authorized`, `payment_captured`, `refund_issued`
merchant_id	string	high cardinality
payment_id	string	stable business key for payment lifecycle
amount_cents	int
currency	string
event_ts	timestamp	producer timestamp
ingest_ts	timestamp	set by streaming job
payload	variant/json	nested metadata

Quality issues introduced by the bug

Duplicates are not guaranteed to have identical event_id.
Duplicates are logically identical by a composite of business fields (e.g., payment_id, event_type, amount_cents, event_ts), but may differ in ingest_ts and minor payload fields.
Events can arrive late (up to 2 hours) due to mobile offline buffering.

Your task

Design a zero-downtime remediation plan that:

Stops further duplication impact (prevent new duplicates from being materialized) while keeping streaming online.
Identifies the affected time window precisely and quantifies impact (counts by event_type/merchant, downstream table deltas).
Deduplicates raw and curated datasets for the 4-hour window, without breaking consumers or requiring a full reload.
Ensures idempotent reprocessing (safe to retry) and handles late-arriving data that overlaps the incident window.
Provides auditability: what was removed/kept, and reproducible logic.
Maintains or improves data quality guarantees going forward (e.g., exactly-once semantics at the curated layer).

Non-functional requirements

No downtime for merchant dashboards and fraud scoring (reads must continue).
Minimal backfill cost: avoid reprocessing more than necessary; target <20% increase in daily compute.
Operational safety: changes must be reversible (rollback plan).
Compliance: changes must be auditable (SOX-like controls for revenue reporting).

Constraints

You must keep Kafka topics and schemas stable for existing consumers.
Snowflake is the system of record for analytics; S3 bronze is the system of record for raw.
Team skill set: strong in Spark, dbt, Snowflake, Airflow; limited Flink expertise.
You cannot pause Snowpipe for more than 5 minutes.

What we’re evaluating

Your approach to deduplication when there is no perfect unique id.
How you design incremental, backfillable pipelines.
How you prevent recurrence via idempotency, constraints, and quality checks.
How you execute a remediation with safe cutover, monitoring, and rollback.

Deliverables in your answer:

A proposed end-to-end remediation plan (including cutover strategy).
Deduplication keys/logic and how you handle edge cases (late arrivals, partial payload differences).
How you would update Spark/dbt models to be robust.
Monitoring/alerting and an operational runbook outline.

Context

Current architecture (simplified)

Producers (Checkout API, Risk service, Web SDK) publish JSON events to Kafka.
Spark Structured Streaming consumes Kafka and writes to S3 in Parquet (raw bronze).
Snowpipe loads raw Parquet into Snowflake (raw.events).
dbt builds curated tables (analytics.payments, analytics.fraud_features) incrementally every 15 minutes.
Airflow orchestrates dbt runs, backfills, and data quality checks.

Scale requirements

Peak ingestion: 250K events/sec (normal), up to 400K events/sec during campaigns.
Event size: ~1.2 KB average JSON.
Daily volume: ~8–12 TB raw Parquet; ~4–6 TB in Snowflake raw.
SLA: curated tables queryable within <10 minutes of event time.
Incident window: 4 hours, duplicates factor 2–5x, affecting ~30% of traffic.
Retention: raw Kafka 3 days, S3 bronze 180 days, Snowflake raw 90 days.

Data characteristics

Event schema (core fields)

Field	Type	Notes
event_id	string	UUID generated by producer (not reliable during incident; duplicates may share or may differ)
event_type	string	e.g., `payment_authorized`, `payment_captured`, `refund_issued`
merchant_id	string	high cardinality
payment_id	string	stable business key for payment lifecycle
amount_cents	int
currency	string
event_ts	timestamp	producer timestamp
ingest_ts	timestamp	set by streaming job
payload	variant/json	nested metadata

Quality issues introduced by the bug

Duplicates are not guaranteed to have identical event_id.
Duplicates are logically identical by a composite of business fields (e.g., payment_id, event_type, amount_cents, event_ts), but may differ in ingest_ts and minor payload fields.
Events can arrive late (up to 2 hours) due to mobile offline buffering.

Your task

Design a zero-downtime remediation plan that:

Stops further duplication impact (prevent new duplicates from being materialized) while keeping streaming online.
Identifies the affected time window precisely and quantifies impact (counts by event_type/merchant, downstream table deltas).
Deduplicates raw and curated datasets for the 4-hour window, without breaking consumers or requiring a full reload.
Ensures idempotent reprocessing (safe to retry) and handles late-arriving data that overlaps the incident window.
Provides auditability: what was removed/kept, and reproducible logic.
Maintains or improves data quality guarantees going forward (e.g., exactly-once semantics at the curated layer).

Non-functional requirements

No downtime for merchant dashboards and fraud scoring (reads must continue).
Minimal backfill cost: avoid reprocessing more than necessary; target <20% increase in daily compute.
Operational safety: changes must be reversible (rollback plan).
Compliance: changes must be auditable (SOX-like controls for revenue reporting).

Constraints

You must keep Kafka topics and schemas stable for existing consumers.
Snowflake is the system of record for analytics; S3 bronze is the system of record for raw.
Team skill set: strong in Spark, dbt, Snowflake, Airflow; limited Flink expertise.
You cannot pause Snowpipe for more than 5 minutes.

What we’re evaluating

Your approach to deduplication when there is no perfect unique id.
How you design incremental, backfillable pipelines.
How you prevent recurrence via idempotency, constraints, and quality checks.
How you execute a remediation with safe cutover, monitoring, and rollback.

Deliverables in your answer:

A proposed end-to-end remediation plan (including cutover strategy).
Deduplication keys/logic and how you handle edge cases (late arrivals, partial payload differences).
How you would update Spark/dbt models to be robust.
Monitoring/alerting and an operational runbook outline.

Context

Current architecture (simplified)

Producers (Checkout API, Risk service, Web SDK) publish JSON events to Kafka.
Spark Structured Streaming consumes Kafka and writes to S3 in Parquet (raw bronze).
Snowpipe loads raw Parquet into Snowflake (raw.events).
dbt builds curated tables (analytics.payments, analytics.fraud_features) incrementally every 15 minutes.
Airflow orchestrates dbt runs, backfills, and data quality checks.

Scale requirements

Peak ingestion: 250K events/sec (normal), up to 400K events/sec during campaigns.
Event size: ~1.2 KB average JSON.
Daily volume: ~8–12 TB raw Parquet; ~4–6 TB in Snowflake raw.
SLA: curated tables queryable within <10 minutes of event time.
Incident window: 4 hours, duplicates factor 2–5x, affecting ~30% of traffic.
Retention: raw Kafka 3 days, S3 bronze 180 days, Snowflake raw 90 days.

Data characteristics

Event schema (core fields)

Field	Type	Notes
event_id	string	UUID generated by producer (not reliable during incident; duplicates may share or may differ)
event_type	string	e.g., `payment_authorized`, `payment_captured`, `refund_issued`
merchant_id	string	high cardinality
payment_id	string	stable business key for payment lifecycle
amount_cents	int
currency	string
event_ts	timestamp	producer timestamp
ingest_ts	timestamp	set by streaming job
payload	variant/json	nested metadata

Quality issues introduced by the bug

Duplicates are not guaranteed to have identical event_id.
Duplicates are logically identical by a composite of business fields (e.g., payment_id, event_type, amount_cents, event_ts), but may differ in ingest_ts and minor payload fields.
Events can arrive late (up to 2 hours) due to mobile offline buffering.

Your task

Design a zero-downtime remediation plan that:

Stops further duplication impact (prevent new duplicates from being materialized) while keeping streaming online.
Identifies the affected time window precisely and quantifies impact (counts by event_type/merchant, downstream table deltas).
Deduplicates raw and curated datasets for the 4-hour window, without breaking consumers or requiring a full reload.
Ensures idempotent reprocessing (safe to retry) and handles late-arriving data that overlaps the incident window.
Provides auditability: what was removed/kept, and reproducible logic.
Maintains or improves data quality guarantees going forward (e.g., exactly-once semantics at the curated layer).

Non-functional requirements

No downtime for merchant dashboards and fraud scoring (reads must continue).
Minimal backfill cost: avoid reprocessing more than necessary; target <20% increase in daily compute.
Operational safety: changes must be reversible (rollback plan).
Compliance: changes must be auditable (SOX-like controls for revenue reporting).

Constraints

You must keep Kafka topics and schemas stable for existing consumers.
Snowflake is the system of record for analytics; S3 bronze is the system of record for raw.
Team skill set: strong in Spark, dbt, Snowflake, Airflow; limited Flink expertise.
You cannot pause Snowpipe for more than 5 minutes.

What we’re evaluating

Your approach to deduplication when there is no perfect unique id.
How you design incremental, backfillable pipelines.
How you prevent recurrence via idempotency, constraints, and quality checks.
How you execute a remediation with safe cutover, monitoring, and rollback.

Deliverables in your answer:

A proposed end-to-end remediation plan (including cutover strategy).
Deduplication keys/logic and how you handle edge cases (late arrivals, partial payload differences).
How you would update Spark/dbt models to be robust.
Monitoring/alerting and an operational runbook outline.

Context

Current architecture (simplified)

Producers (Checkout API, Risk service, Web SDK) publish JSON events to Kafka.
Spark Structured Streaming consumes Kafka and writes to S3 in Parquet (raw bronze).
Snowpipe loads raw Parquet into Snowflake (raw.events).
dbt builds curated tables (analytics.payments, analytics.fraud_features) incrementally every 15 minutes.
Airflow orchestrates dbt runs, backfills, and data quality checks.

Scale requirements

Peak ingestion: 250K events/sec (normal), up to 400K events/sec during campaigns.
Event size: ~1.2 KB average JSON.
Daily volume: ~8–12 TB raw Parquet; ~4–6 TB in Snowflake raw.
SLA: curated tables queryable within <10 minutes of event time.
Incident window: 4 hours, duplicates factor 2–5x, affecting ~30% of traffic.
Retention: raw Kafka 3 days, S3 bronze 180 days, Snowflake raw 90 days.

Data characteristics

Event schema (core fields)

Field	Type	Notes
event_id	string	UUID generated by producer (not reliable during incident; duplicates may share or may differ)
event_type	string	e.g., `payment_authorized`, `payment_captured`, `refund_issued`
merchant_id	string	high cardinality
payment_id	string	stable business key for payment lifecycle
amount_cents	int
currency	string
event_ts	timestamp	producer timestamp
ingest_ts	timestamp	set by streaming job
payload	variant/json	nested metadata

Quality issues introduced by the bug

Duplicates are not guaranteed to have identical event_id.
Duplicates are logically identical by a composite of business fields (e.g., payment_id, event_type, amount_cents, event_ts), but may differ in ingest_ts and minor payload fields.
Events can arrive late (up to 2 hours) due to mobile offline buffering.

Your task

Design a zero-downtime remediation plan that:

Stops further duplication impact (prevent new duplicates from being materialized) while keeping streaming online.
Identifies the affected time window precisely and quantifies impact (counts by event_type/merchant, downstream table deltas).
Deduplicates raw and curated datasets for the 4-hour window, without breaking consumers or requiring a full reload.
Ensures idempotent reprocessing (safe to retry) and handles late-arriving data that overlaps the incident window.
Provides auditability: what was removed/kept, and reproducible logic.
Maintains or improves data quality guarantees going forward (e.g., exactly-once semantics at the curated layer).

Non-functional requirements

No downtime for merchant dashboards and fraud scoring (reads must continue).
Minimal backfill cost: avoid reprocessing more than necessary; target <20% increase in daily compute.
Operational safety: changes must be reversible (rollback plan).
Compliance: changes must be auditable (SOX-like controls for revenue reporting).

Constraints

You must keep Kafka topics and schemas stable for existing consumers.
Snowflake is the system of record for analytics; S3 bronze is the system of record for raw.
Team skill set: strong in Spark, dbt, Snowflake, Airflow; limited Flink expertise.
You cannot pause Snowpipe for more than 5 minutes.

What we’re evaluating

Your approach to deduplication when there is no perfect unique id.
How you design incremental, backfillable pipelines.
How you prevent recurrence via idempotency, constraints, and quality checks.
How you execute a remediation with safe cutover, monitoring, and rollback.

Deliverables in your answer:

A proposed end-to-end remediation plan (including cutover strategy).
Deduplication keys/logic and how you handle edge cases (late arrivals, partial payload differences).
How you would update Spark/dbt models to be robust.
Monitoring/alerting and an operational runbook outline.

Interview Guides

Context

Current architecture (simplified)

Scale requirements

Data characteristics

Event schema (core fields)

Quality issues introduced by the bug

Your task

Non-functional requirements

Constraints

What we’re evaluating

Deduplicate 4-Hour Event Spike Live

Context

Current architecture (simplified)

Scale requirements

Data characteristics

Event schema (core fields)

Quality issues introduced by the bug

Your task

Non-functional requirements

Constraints

What we’re evaluating

Your Answer

Deduplicate 4-Hour Event Spike Live

Context

Current architecture (simplified)

Scale requirements

Data characteristics

Event schema (core fields)

Quality issues introduced by the bug

Your task

Non-functional requirements

Constraints

What we’re evaluating

Deduplicate 4-Hour Event Spike Live

Context

Current architecture (simplified)

Scale requirements

Data characteristics

Event schema (core fields)

Quality issues introduced by the bug

Your task

Non-functional requirements

Constraints

What we’re evaluating

Your Answer