Interview Guides

Zero-Downtime Deduplication at Scale

Hard

SQL & Data Manipulation

Context

You’re on the data platform team at a global fintech processor that ingests 2–5 billion card events per day into a PostgreSQL-compatible warehouse powering fraud detection, chargeback workflows, and regulatory reporting. A bug in an upstream Kafka consumer caused duplicate writes into the card_authorizations fact table for ~36 hours. The table is append-heavy, queried continuously by risk models and dashboards, and is also used to generate audit artifacts.

Because this table is business-critical, you cannot take downtime, cannot block reads for long, and you must avoid massive long-running transactions that bloat WAL/undo, saturate I/O, or trigger replication lag. You need a plan that both (a) identifies duplicates reliably and (b) removes them safely while the system stays online.

Core Question

How would you identify and remove duplicate records from a table with billions of rows without downtime?

In your answer, cover the following:

Definition of “duplicate”: what key(s) define uniqueness (e.g., merchant_id + network_auth_id, or a computed idempotency key), and how you handle near-duplicates (same key but different payload).
Detection approach: how you would find duplicates efficiently (e.g., window functions like ROW_NUMBER(), grouping, hashing) and how you would scope the scan (time partitions, affected shards, incremental ranges).
Removal approach without downtime: describe at least one safe strategy such as:
- creating a new deduplicated table and performing a backfill + dual-write cutover, or
- deleting in small batches using a stable ordering key, or
- using partition swaps (where supported).
Constraints & correctness: how you ensure consistent results while new rows arrive (isolation level, watermarking, idempotent writes, reconciliation queries).
Performance & operational safety: indexes required, batching size, avoiding table locks, monitoring (replication lag, dead tuples, vacuum), and rollback plan.

Scope Guidance (What a strong answer includes)

Concrete SQL patterns (e.g., ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) and how you’d use them in a staged process.
Trade-offs between “rebuild + swap” vs “in-place batched deletes” and when each is appropriate.
A brief discussion of enforcing future uniqueness (unique constraints, partial indexes, idempotency keys, or upstream exactly-once semantics).

Zero-Downtime Deduplication at Scale

Hard

SQL & Data Manipulation

Context

Core Question

How would you identify and remove duplicate records from a table with billions of rows without downtime?

In your answer, cover the following:

Definition of “duplicate”: what key(s) define uniqueness (e.g., merchant_id + network_auth_id, or a computed idempotency key), and how you handle near-duplicates (same key but different payload).
Detection approach: how you would find duplicates efficiently (e.g., window functions like ROW_NUMBER(), grouping, hashing) and how you would scope the scan (time partitions, affected shards, incremental ranges).
Removal approach without downtime: describe at least one safe strategy such as:
- creating a new deduplicated table and performing a backfill + dual-write cutover, or
- deleting in small batches using a stable ordering key, or
- using partition swaps (where supported).
Constraints & correctness: how you ensure consistent results while new rows arrive (isolation level, watermarking, idempotent writes, reconciliation queries).
Performance & operational safety: indexes required, batching size, avoiding table locks, monitoring (replication lag, dead tuples, vacuum), and rollback plan.

Scope Guidance (What a strong answer includes)

Concrete SQL patterns (e.g., ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) and how you’d use them in a staged process.
Trade-offs between “rebuild + swap” vs “in-place batched deletes” and when each is appropriate.
A brief discussion of enforcing future uniqueness (unique constraints, partial indexes, idempotency keys, or upstream exactly-once semantics).

Your Answer

Zero-Downtime Deduplication at Scale

Hard

SQL & Data Manipulation

Context

Core Question

How would you identify and remove duplicate records from a table with billions of rows without downtime?

In your answer, cover the following:

Definition of “duplicate”: what key(s) define uniqueness (e.g., merchant_id + network_auth_id, or a computed idempotency key), and how you handle near-duplicates (same key but different payload).
Detection approach: how you would find duplicates efficiently (e.g., window functions like ROW_NUMBER(), grouping, hashing) and how you would scope the scan (time partitions, affected shards, incremental ranges).
Removal approach without downtime: describe at least one safe strategy such as:
- creating a new deduplicated table and performing a backfill + dual-write cutover, or
- deleting in small batches using a stable ordering key, or
- using partition swaps (where supported).
Constraints & correctness: how you ensure consistent results while new rows arrive (isolation level, watermarking, idempotent writes, reconciliation queries).
Performance & operational safety: indexes required, batching size, avoiding table locks, monitoring (replication lag, dead tuples, vacuum), and rollback plan.

Scope Guidance (What a strong answer includes)

Concrete SQL patterns (e.g., ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) and how you’d use them in a staged process.
Trade-offs between “rebuild + swap” vs “in-place batched deletes” and when each is appropriate.
A brief discussion of enforcing future uniqueness (unique constraints, partial indexes, idempotency keys, or upstream exactly-once semantics).

Zero-Downtime Deduplication at Scale

Hard

SQL & Data Manipulation

Context

Core Question

How would you identify and remove duplicate records from a table with billions of rows without downtime?

In your answer, cover the following:

Definition of “duplicate”: what key(s) define uniqueness (e.g., merchant_id + network_auth_id, or a computed idempotency key), and how you handle near-duplicates (same key but different payload).
Detection approach: how you would find duplicates efficiently (e.g., window functions like ROW_NUMBER(), grouping, hashing) and how you would scope the scan (time partitions, affected shards, incremental ranges).
Removal approach without downtime: describe at least one safe strategy such as:
- creating a new deduplicated table and performing a backfill + dual-write cutover, or
- deleting in small batches using a stable ordering key, or
- using partition swaps (where supported).
Constraints & correctness: how you ensure consistent results while new rows arrive (isolation level, watermarking, idempotent writes, reconciliation queries).
Performance & operational safety: indexes required, batching size, avoiding table locks, monitoring (replication lag, dead tuples, vacuum), and rollback plan.

Scope Guidance (What a strong answer includes)

Concrete SQL patterns (e.g., ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) and how you’d use them in a staged process.
Trade-offs between “rebuild + swap” vs “in-place batched deletes” and when each is appropriate.
A brief discussion of enforcing future uniqueness (unique constraints, partial indexes, idempotency keys, or upstream exactly-once semantics).