You’re on the data platform team at a global fintech processor that ingests 2–5 billion card events per day into a PostgreSQL-compatible warehouse powering fraud detection, chargeback workflows, and regulatory reporting. A bug in an upstream Kafka consumer caused duplicate writes into the card_authorizations fact table for ~36 hours. The table is append-heavy, queried continuously by risk models and dashboards, and is also used to generate audit artifacts.
Because this table is business-critical, you cannot take downtime, cannot block reads for long, and you must avoid massive long-running transactions that bloat WAL/undo, saturate I/O, or trigger replication lag. You need a plan that both (a) identifies duplicates reliably and (b) removes them safely while the system stays online.
How would you identify and remove duplicate records from a table with billions of rows without downtime?
In your answer, cover the following:
merchant_id + network_auth_id, or a computed idempotency key), and how you handle near-duplicates (same key but different payload).ROW_NUMBER(), grouping, hashing) and how you would scope the scan (time partitions, affected shards, incremental ranges).ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) and how you’d use them in a staged process.Deduplication requires a deterministic rule for which row to keep (e.g., earliest ingestion, latest update, highest-quality payload). This prevents flapping results across runs and enables idempotent cleanup jobs.
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY merchant_id, network_auth_id
ORDER BY ingested_at ASC, auth_event_id ASC
) AS rn
FROM card_authorizations;
On billion-row tables, you rarely scan everything. You constrain work to affected partitions (e.g., the 36-hour window), or to ranges on a clustered key, and process incrementally with watermarks to control runtime and I/O.
WHERE ingested_at >= TIMESTAMP '2026-01-10 00:00:00'
AND ingested_at < TIMESTAMP '2026-01-11 12:00:00'
A common no-downtime pattern is to build a new table with the correct uniqueness, backfill deduped data, then dual-write new events to both tables until caught up. Finally, you cut over via a view rename or atomic swap, minimizing lock time.
CREATE TABLE card_authorizations_v2 (LIKE card_authorizations INCLUDING ALL);
-- Add a uniqueness guarantee in v2
CREATE UNIQUE INDEX CONCURRENTLY ux_auth_key
ON card_authorizations_v2 (merchant_id, network_auth_id);
If rebuilding is too expensive, you can delete duplicates in small batches to avoid long transactions and heavy locking. The batch query should use an indexable predicate and delete by primary key to keep each batch fast and predictable.
WITH d AS (
SELECT auth_event_id
FROM (
SELECT auth_event_id,
ROW_NUMBER() OVER (PARTITION BY merchant_id, network_auth_id ORDER BY ingested_at, auth_event_id) rn
FROM card_authorizations
WHERE ingested_at >= :start_ts AND ingested_at < :end_ts
) x
WHERE rn > 1
ORDER BY auth_event_id
LIMIT 50000
)
DELETE FROM card_authorizations
WHERE auth_event_id IN (SELECT auth_event_id FROM d);
Cleanup is incomplete if duplicates can reappear. You typically add an idempotency key (or natural unique key) and enforce it with a unique index/constraint, plus upstream retry semantics that use INSERT…ON CONFLICT (or MERGE) to make writes idempotent.
INSERT INTO card_authorizations (merchant_id, network_auth_id, amount_cents, ingested_at)
VALUES (...)
ON CONFLICT (merchant_id, network_auth_id)
DO UPDATE SET amount_cents = EXCLUDED.amount_cents,
ingested_at = LEAST(card_authorizations.ingested_at, EXCLUDED.ingested_at);