You’re on the data platform team at a fintech payments company processing 50M+ ledger events/day from multiple upstream services (KYC, card processing, bank transfers). Events are ingested into a centralized warehouse for downstream reconciliation and regulatory reporting.
A recent incident revealed that the upstream “replication” service occasionally emits duplicate Repl created events for the same business entity (e.g., the same transfer), and these duplicates can arrive out of order due to retries, network delays, and at-least-once delivery semantics. If duplicates aren’t removed, downstream jobs may double-count “created” states, causing incorrect operational dashboards and noisy alerts.
Write a SQL query that identifies duplicate Repl created events that may arrive out of order and outputs the rows that should be removed (i.e., all but the earliest true creation per entity).
Assume that for a given entity_id, the correct record to keep is the event with the earliest created_at timestamp. If multiple events tie on created_at, keep the one with the smallest event_id.
event_type = 'Repl created'.entity_id and event_type.entity_id, keep exactly 1 Repl created event (the earliest by created_at, tie-break by smallest event_id).event_id, entity_id, event_type, created_at, ingested_at, and duplicate_rank (where rank 1 is the kept row, rank > 1 are duplicates).entity_id, then duplicate_rank, then created_at.replication_events| Column | Type | Description |
|---|---|---|
| event_id | BIGINT | Unique event row identifier (monotonic, not necessarily time-ordered) |
| entity_id | VARCHAR(64) | Business entity identifier (e.g., transfer_id) |
| event_type | VARCHAR(50) | Event name (e.g., Repl created, Repl updated) |
| created_at | TIMESTAMP | Timestamp when the event occurred in the source system |
| ingested_at | TIMESTAMP | Timestamp when the event was ingested into the warehouse |
| source | VARCHAR(30) | Producing service (e.g., replicator-v2) |
replication_events| event_id | entity_id | event_type | created_at | ingested_at | source |
|---|---|---|---|---|---|
| 9001 | tr_1001 | Repl created | 2026-01-10 10:00:00 | 2026-01-10 10:05:00 | replicator-v2 |
| 9002 | tr_1001 | Repl created | 2026-01-10 10:00:00 | 2026-01-10 10:01:00 | replicator-v2 |
| 9003 | tr_1001 | Repl created | 2026-01-10 09:59:58 | 2026-01-10 10:06:00 | replicator-v2 |
| 9010 | tr_2002 | Repl created | 2026-01-11 12:00:00 | 2026-01-11 12:00:03 | replicator-v2 |
| 9011 | tr_2002 | Repl updated | 2026-01-11 12:05:00 | 2026-01-11 12:05:02 | replicator-v2 |
Rows to delete (duplicates only):
| event_id | entity_id | event_type | created_at | ingested_at | duplicate_rank |
|---|---|---|---|---|---|
| 9001 | tr_1001 | Repl created | 2026-01-10 10:00:00 | 2026-01-10 10:05:00 | 2 |
| 9002 | tr_1001 | Repl created | 2026-01-10 10:00:00 | 2026-01-10 10:01:00 | 3 |
Explanation for the sample: even though 9003 arrived later (ingested_at is 10:06), it has the earliest created_at (09:59:58) and is the one to keep.