You’re on the data platform team at a global fintech processor that ingests 2–5 billion card events per day into a PostgreSQL-compatible warehouse powering fraud detection, chargeback workflows, and regulatory reporting. A bug in an upstream Kafka consumer caused duplicate writes into the card_authorizations fact table for ~36 hours. The table is append-heavy, queried continuously by risk models and dashboards, and is also used to generate audit artifacts.
Because this table is business-critical, you cannot take downtime, cannot block reads for long, and you must avoid massive long-running transactions that bloat WAL/undo, saturate I/O, or trigger replication lag. You need a plan that both (a) identifies duplicates reliably and (b) removes them safely while the system stays online.
How would you identify and remove duplicate records from a table with billions of rows without downtime?
In your answer, cover the following:
merchant_id + network_auth_id, or a computed idempotency key), and how you handle near-duplicates (same key but different payload).ROW_NUMBER(), grouping, hashing) and how you would scope the scan (time partitions, affected shards, incremental ranges).ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) and how you’d use them in a staged process.