Interview Guides

Zero-Downtime Deduplication at Scale | Dataford Interview Questions - Dataford - Ace your Interview

Zero-Downtime Deduplication at Scale

Hard

SQL & Data Manipulation

JoinsAggregationsCTEsSubqueriesData Wrangling

Problem

Context

You’re on the data platform team at a global fintech processor that ingests 2–5 billion card events per day into a PostgreSQL-compatible warehouse powering fraud detection, chargeback workflows, and regulatory reporting. A bug in an upstream Kafka consumer caused duplicate writes into the card_authorizations fact table for ~36 hours. The table is append-heavy, queried continuously by risk models and dashboards, and is also used to generate audit artifacts.

Because this table is business-critical, you cannot take downtime, cannot block reads for long, and you must avoid massive long-running transactions that bloat WAL/undo, saturate I/O, or trigger replication lag. You need a plan that both (a) identifies duplicates reliably and (b) removes them safely while the system stays online.

Core Question

How would you identify and remove duplicate records from a table with billions of rows without downtime?

In your answer, cover the following:

Definition of “duplicate”: what key(s) define uniqueness (e.g., merchant_id + network_auth_id, or a computed idempotency key), and how you handle near-duplicates (same key but different payload).
Detection approach: how you would find duplicates efficiently (e.g., window functions like ROW_NUMBER(), grouping, hashing) and how you would scope the scan (time partitions, affected shards, incremental ranges).
Removal approach without downtime: describe at least one safe strategy such as:
- creating a new deduplicated table and performing a backfill + dual-write cutover, or
- deleting in small batches using a stable ordering key, or
- using partition swaps (where supported).
Constraints & correctness: how you ensure consistent results while new rows arrive (isolation level, watermarking, idempotent writes, reconciliation queries).
Performance & operational safety: indexes required, batching size, avoiding table locks, monitoring (replication lag, dead tuples, vacuum), and rollback plan.

Scope Guidance (What a strong answer includes)

Concrete SQL patterns (e.g., ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) and how you’d use them in a staged process.
Trade-offs between “rebuild + swap” vs “in-place batched deletes” and when each is appropriate.
A brief discussion of enforcing future uniqueness (unique constraints, partial indexes, idempotency keys, or upstream exactly-once semantics).

Key Concepts

Defining a deterministic survivor row

Deduplication requires a deterministic rule for which row to keep (e.g., earliest ingestion, latest update, highest-quality payload). This prevents flapping results across runs and enables idempotent cleanup jobs.

SELECT *,
       ROW_NUMBER() OVER (
         PARTITION BY merchant_id, network_auth_id
         ORDER BY ingested_at ASC, auth_event_id ASC
       ) AS rn
FROM card_authorizations;

Incremental scoping (time/partition/range) to avoid full-table scans

On billion-row tables, you rarely scan everything. You constrain work to affected partitions (e.g., the 36-hour window), or to ranges on a clustered key, and process incrementally with watermarks to control runtime and I/O.

WHERE ingested_at >= TIMESTAMP '2026-01-10 00:00:00'
  AND ingested_at <  TIMESTAMP '2026-01-11 12:00:00'

Zero-downtime rebuild-and-swap (backfill + dual-write)

A common no-downtime pattern is to build a new table with the correct uniqueness, backfill deduped data, then dual-write new events to both tables until caught up. Finally, you cut over via a view rename or atomic swap, minimizing lock time.

CREATE TABLE card_authorizations_v2 (LIKE card_authorizations INCLUDING ALL);
-- Add a uniqueness guarantee in v2
CREATE UNIQUE INDEX CONCURRENTLY ux_auth_key
ON card_authorizations_v2 (merchant_id, network_auth_id);

Batched deletes using a stable key to limit lock time

If rebuilding is too expensive, you can delete duplicates in small batches to avoid long transactions and heavy locking. The batch query should use an indexable predicate and delete by primary key to keep each batch fast and predictable.

WITH d AS (
  SELECT auth_event_id
  FROM (
    SELECT auth_event_id,
           ROW_NUMBER() OVER (PARTITION BY merchant_id, network_auth_id ORDER BY ingested_at, auth_event_id) rn
    FROM card_authorizations
    WHERE ingested_at >= :start_ts AND ingested_at < :end_ts
  ) x
  WHERE rn > 1
  ORDER BY auth_event_id
  LIMIT 50000
)
DELETE FROM card_authorizations
WHERE auth_event_id IN (SELECT auth_event_id FROM d);

Preventing recurrence with idempotency and constraints

Cleanup is incomplete if duplicates can reappear. You typically add an idempotency key (or natural unique key) and enforce it with a unique index/constraint, plus upstream retry semantics that use INSERT…ON CONFLICT (or MERGE) to make writes idempotent.

INSERT INTO card_authorizations (merchant_id, network_auth_id, amount_cents, ingested_at)
VALUES (...)
ON CONFLICT (merchant_id, network_auth_id)
DO UPDATE SET amount_cents = EXCLUDED.amount_cents,
              ingested_at = LEAST(card_authorizations.ingested_at, EXCLUDED.ingested_at);

Problem

Context

Core Question

How would you identify and remove duplicate records from a table with billions of rows without downtime?

In your answer, cover the following:

Definition of “duplicate”: what key(s) define uniqueness (e.g., merchant_id + network_auth_id, or a computed idempotency key), and how you handle near-duplicates (same key but different payload).
Detection approach: how you would find duplicates efficiently (e.g., window functions like ROW_NUMBER(), grouping, hashing) and how you would scope the scan (time partitions, affected shards, incremental ranges).
Removal approach without downtime: describe at least one safe strategy such as:
- creating a new deduplicated table and performing a backfill + dual-write cutover, or
- deleting in small batches using a stable ordering key, or
- using partition swaps (where supported).
Constraints & correctness: how you ensure consistent results while new rows arrive (isolation level, watermarking, idempotent writes, reconciliation queries).
Performance & operational safety: indexes required, batching size, avoiding table locks, monitoring (replication lag, dead tuples, vacuum), and rollback plan.

Scope Guidance (What a strong answer includes)

Concrete SQL patterns (e.g., ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) and how you’d use them in a staged process.
Trade-offs between “rebuild + swap” vs “in-place batched deletes” and when each is appropriate.
A brief discussion of enforcing future uniqueness (unique constraints, partial indexes, idempotency keys, or upstream exactly-once semantics).

Key Concepts

Defining a deterministic survivor row

SELECT *,
       ROW_NUMBER() OVER (
         PARTITION BY merchant_id, network_auth_id
         ORDER BY ingested_at ASC, auth_event_id ASC
       ) AS rn
FROM card_authorizations;

Incremental scoping (time/partition/range) to avoid full-table scans

WHERE ingested_at >= TIMESTAMP '2026-01-10 00:00:00'
  AND ingested_at <  TIMESTAMP '2026-01-11 12:00:00'

Zero-downtime rebuild-and-swap (backfill + dual-write)

CREATE TABLE card_authorizations_v2 (LIKE card_authorizations INCLUDING ALL);
-- Add a uniqueness guarantee in v2
CREATE UNIQUE INDEX CONCURRENTLY ux_auth_key
ON card_authorizations_v2 (merchant_id, network_auth_id);

Batched deletes using a stable key to limit lock time

WITH d AS (
  SELECT auth_event_id
  FROM (
    SELECT auth_event_id,
           ROW_NUMBER() OVER (PARTITION BY merchant_id, network_auth_id ORDER BY ingested_at, auth_event_id) rn
    FROM card_authorizations
    WHERE ingested_at >= :start_ts AND ingested_at < :end_ts
  ) x
  WHERE rn > 1
  ORDER BY auth_event_id
  LIMIT 50000
)
DELETE FROM card_authorizations
WHERE auth_event_id IN (SELECT auth_event_id FROM d);

Preventing recurrence with idempotency and constraints

INSERT INTO card_authorizations (merchant_id, network_auth_id, amount_cents, ingested_at)
VALUES (...)
ON CONFLICT (merchant_id, network_auth_id)
DO UPDATE SET amount_cents = EXCLUDED.amount_cents,
              ingested_at = LEAST(card_authorizations.ingested_at, EXCLUDED.ingested_at);

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Debug and Optimize Fraud DeduplicationMedium RDeduplicate Out-of-Order Replication EventsMedium Offline Sync Queue With DedupMedium

Next question

Zero-Downtime Deduplication at Scale

Hard

SQL & Data Manipulation

JoinsAggregationsCTEsSubqueriesData Wrangling

Problem

Context

Core Question

How would you identify and remove duplicate records from a table with billions of rows without downtime?

In your answer, cover the following:

Definition of “duplicate”: what key(s) define uniqueness (e.g., merchant_id + network_auth_id, or a computed idempotency key), and how you handle near-duplicates (same key but different payload).
Detection approach: how you would find duplicates efficiently (e.g., window functions like ROW_NUMBER(), grouping, hashing) and how you would scope the scan (time partitions, affected shards, incremental ranges).
Removal approach without downtime: describe at least one safe strategy such as:
- creating a new deduplicated table and performing a backfill + dual-write cutover, or
- deleting in small batches using a stable ordering key, or
- using partition swaps (where supported).
Constraints & correctness: how you ensure consistent results while new rows arrive (isolation level, watermarking, idempotent writes, reconciliation queries).
Performance & operational safety: indexes required, batching size, avoiding table locks, monitoring (replication lag, dead tuples, vacuum), and rollback plan.

Scope Guidance (What a strong answer includes)

Concrete SQL patterns (e.g., ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) and how you’d use them in a staged process.
Trade-offs between “rebuild + swap” vs “in-place batched deletes” and when each is appropriate.
A brief discussion of enforcing future uniqueness (unique constraints, partial indexes, idempotency keys, or upstream exactly-once semantics).

Key Concepts

Defining a deterministic survivor row

SELECT *,
       ROW_NUMBER() OVER (
         PARTITION BY merchant_id, network_auth_id
         ORDER BY ingested_at ASC, auth_event_id ASC
       ) AS rn
FROM card_authorizations;

Incremental scoping (time/partition/range) to avoid full-table scans

WHERE ingested_at >= TIMESTAMP '2026-01-10 00:00:00'
  AND ingested_at <  TIMESTAMP '2026-01-11 12:00:00'

Zero-downtime rebuild-and-swap (backfill + dual-write)

CREATE TABLE card_authorizations_v2 (LIKE card_authorizations INCLUDING ALL);
-- Add a uniqueness guarantee in v2
CREATE UNIQUE INDEX CONCURRENTLY ux_auth_key
ON card_authorizations_v2 (merchant_id, network_auth_id);

Batched deletes using a stable key to limit lock time

WITH d AS (
  SELECT auth_event_id
  FROM (
    SELECT auth_event_id,
           ROW_NUMBER() OVER (PARTITION BY merchant_id, network_auth_id ORDER BY ingested_at, auth_event_id) rn
    FROM card_authorizations
    WHERE ingested_at >= :start_ts AND ingested_at < :end_ts
  ) x
  WHERE rn > 1
  ORDER BY auth_event_id
  LIMIT 50000
)
DELETE FROM card_authorizations
WHERE auth_event_id IN (SELECT auth_event_id FROM d);

Preventing recurrence with idempotency and constraints

INSERT INTO card_authorizations (merchant_id, network_auth_id, amount_cents, ingested_at)
VALUES (...)
ON CONFLICT (merchant_id, network_auth_id)
DO UPDATE SET amount_cents = EXCLUDED.amount_cents,
              ingested_at = LEAST(card_authorizations.ingested_at, EXCLUDED.ingested_at);

Problem

Context

Core Question

How would you identify and remove duplicate records from a table with billions of rows without downtime?

In your answer, cover the following:

Definition of “duplicate”: what key(s) define uniqueness (e.g., merchant_id + network_auth_id, or a computed idempotency key), and how you handle near-duplicates (same key but different payload).
Detection approach: how you would find duplicates efficiently (e.g., window functions like ROW_NUMBER(), grouping, hashing) and how you would scope the scan (time partitions, affected shards, incremental ranges).
Removal approach without downtime: describe at least one safe strategy such as:
- creating a new deduplicated table and performing a backfill + dual-write cutover, or
- deleting in small batches using a stable ordering key, or
- using partition swaps (where supported).
Constraints & correctness: how you ensure consistent results while new rows arrive (isolation level, watermarking, idempotent writes, reconciliation queries).
Performance & operational safety: indexes required, batching size, avoiding table locks, monitoring (replication lag, dead tuples, vacuum), and rollback plan.

Scope Guidance (What a strong answer includes)

Concrete SQL patterns (e.g., ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) and how you’d use them in a staged process.
Trade-offs between “rebuild + swap” vs “in-place batched deletes” and when each is appropriate.
A brief discussion of enforcing future uniqueness (unique constraints, partial indexes, idempotency keys, or upstream exactly-once semantics).

Key Concepts

Defining a deterministic survivor row

SELECT *,
       ROW_NUMBER() OVER (
         PARTITION BY merchant_id, network_auth_id
         ORDER BY ingested_at ASC, auth_event_id ASC
       ) AS rn
FROM card_authorizations;

Incremental scoping (time/partition/range) to avoid full-table scans

WHERE ingested_at >= TIMESTAMP '2026-01-10 00:00:00'
  AND ingested_at <  TIMESTAMP '2026-01-11 12:00:00'

Zero-downtime rebuild-and-swap (backfill + dual-write)

CREATE TABLE card_authorizations_v2 (LIKE card_authorizations INCLUDING ALL);
-- Add a uniqueness guarantee in v2
CREATE UNIQUE INDEX CONCURRENTLY ux_auth_key
ON card_authorizations_v2 (merchant_id, network_auth_id);

Batched deletes using a stable key to limit lock time

WITH d AS (
  SELECT auth_event_id
  FROM (
    SELECT auth_event_id,
           ROW_NUMBER() OVER (PARTITION BY merchant_id, network_auth_id ORDER BY ingested_at, auth_event_id) rn
    FROM card_authorizations
    WHERE ingested_at >= :start_ts AND ingested_at < :end_ts
  ) x
  WHERE rn > 1
  ORDER BY auth_event_id
  LIMIT 50000
)
DELETE FROM card_authorizations
WHERE auth_event_id IN (SELECT auth_event_id FROM d);

Preventing recurrence with idempotency and constraints

INSERT INTO card_authorizations (merchant_id, network_auth_id, amount_cents, ingested_at)
VALUES (...)
ON CONFLICT (merchant_id, network_auth_id)
DO UPDATE SET amount_cents = EXCLUDED.amount_cents,
              ingested_at = LEAST(card_authorizations.ingested_at, EXCLUDED.ingested_at);

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Debug and Optimize Fraud DeduplicationMedium RDeduplicate Out-of-Order Replication EventsMedium Offline Sync Queue With DedupMedium

Next question