Deduplicate Out-of-Order Replication Events

Business Context

You’re on the data platform team at a fintech payments company processing 50M+ ledger events/day from multiple upstream services (KYC, card processing, bank transfers). Events are ingested into a centralized warehouse for downstream reconciliation and regulatory reporting.

A recent incident revealed that the upstream “replication” service occasionally emits duplicate Repl created events for the same business entity (e.g., the same transfer), and these duplicates can arrive out of order due to retries, network delays, and at-least-once delivery semantics. If duplicates aren’t removed, downstream jobs may double-count “created” states, causing incorrect operational dashboards and noisy alerts.

Task

Write a SQL query that identifies duplicate Repl created events that may arrive out of order and outputs the rows that should be removed (i.e., all but the earliest true creation per entity).

Assume that for a given entity_id, the correct record to keep is the event with the earliest created_at timestamp. If multiple events tie on created_at, keep the one with the smallest event_id.

Requirements

Consider only rows where event_type = 'Repl created'.
Define duplicates as events with the same entity_id and event_type.
For each entity_id, keep exactly 1 Repl created event (the earliest by created_at, tie-break by smallest event_id).
Return the events that should be deleted, with columns: event_id, entity_id, event_type, created_at, ingested_at, and duplicate_rank (where rank 1 is the kept row, rank > 1 are duplicates).
Order the output by entity_id, then duplicate_rank, then created_at.

Table Definitions

`replication_events`

Column	Type	Description
event_id	BIGINT	Unique event row identifier (monotonic, not necessarily time-ordered)
entity_id	VARCHAR(64)	Business entity identifier (e.g., transfer_id)
event_type	VARCHAR(50)	Event name (e.g., `Repl created`, `Repl updated`)
created_at	TIMESTAMP	Timestamp when the event occurred in the source system
ingested_at	TIMESTAMP	Timestamp when the event was ingested into the warehouse
source	VARCHAR(30)	Producing service (e.g., `replicator-v2`)

Sample Data

`replication_events`

event_id	entity_id	event_type	created_at	ingested_at	source
9001	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:05:00	replicator-v2
9002	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:01:00	replicator-v2
9003	tr_1001	Repl created	2026-01-10 09:59:58	2026-01-10 10:06:00	replicator-v2
9010	tr_2002	Repl created	2026-01-11 12:00:00	2026-01-11 12:00:03	replicator-v2
9011	tr_2002	Repl updated	2026-01-11 12:05:00	2026-01-11 12:05:02	replicator-v2

Expected Output

Rows to delete (duplicates only):

event_id	entity_id	event_type	created_at	ingested_at	duplicate_rank
9001	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:05:00	2
9002	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:01:00	3

Explanation for the sample: even though 9003 arrived later (ingested_at is 10:06), it has the earliest created_at (09:59:58) and is the one to keep.

Business Context

Task

Requirements

Consider only rows where event_type = 'Repl created'.
Define duplicates as events with the same entity_id and event_type.
For each entity_id, keep exactly 1 Repl created event (the earliest by created_at, tie-break by smallest event_id).
Return the events that should be deleted, with columns: event_id, entity_id, event_type, created_at, ingested_at, and duplicate_rank (where rank 1 is the kept row, rank > 1 are duplicates).
Order the output by entity_id, then duplicate_rank, then created_at.

Table Definitions

`replication_events`

Column	Type	Description
event_id	BIGINT	Unique event row identifier (monotonic, not necessarily time-ordered)
entity_id	VARCHAR(64)	Business entity identifier (e.g., transfer_id)
event_type	VARCHAR(50)	Event name (e.g., `Repl created`, `Repl updated`)
created_at	TIMESTAMP	Timestamp when the event occurred in the source system
ingested_at	TIMESTAMP	Timestamp when the event was ingested into the warehouse
source	VARCHAR(30)	Producing service (e.g., `replicator-v2`)

Sample Data

`replication_events`

event_id	entity_id	event_type	created_at	ingested_at	source
9001	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:05:00	replicator-v2
9002	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:01:00	replicator-v2
9003	tr_1001	Repl created	2026-01-10 09:59:58	2026-01-10 10:06:00	replicator-v2
9010	tr_2002	Repl created	2026-01-11 12:00:00	2026-01-11 12:00:03	replicator-v2
9011	tr_2002	Repl updated	2026-01-11 12:05:00	2026-01-11 12:05:02	replicator-v2

Expected Output

Rows to delete (duplicates only):

event_id	entity_id	event_type	created_at	ingested_at	duplicate_rank
9001	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:05:00	2
9002	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:01:00	3

Explanation for the sample: even though 9003 arrived later (ingested_at is 10:06), it has the earliest created_at (09:59:58) and is the one to keep.

Business Context

Task

Requirements

Consider only rows where event_type = 'Repl created'.
Define duplicates as events with the same entity_id and event_type.
For each entity_id, keep exactly 1 Repl created event (the earliest by created_at, tie-break by smallest event_id).
Return the events that should be deleted, with columns: event_id, entity_id, event_type, created_at, ingested_at, and duplicate_rank (where rank 1 is the kept row, rank > 1 are duplicates).
Order the output by entity_id, then duplicate_rank, then created_at.

Table Definitions

`replication_events`

Column	Type	Description
event_id	BIGINT	Unique event row identifier (monotonic, not necessarily time-ordered)
entity_id	VARCHAR(64)	Business entity identifier (e.g., transfer_id)
event_type	VARCHAR(50)	Event name (e.g., `Repl created`, `Repl updated`)
created_at	TIMESTAMP	Timestamp when the event occurred in the source system
ingested_at	TIMESTAMP	Timestamp when the event was ingested into the warehouse
source	VARCHAR(30)	Producing service (e.g., `replicator-v2`)

Sample Data

`replication_events`

event_id	entity_id	event_type	created_at	ingested_at	source
9001	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:05:00	replicator-v2
9002	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:01:00	replicator-v2
9003	tr_1001	Repl created	2026-01-10 09:59:58	2026-01-10 10:06:00	replicator-v2
9010	tr_2002	Repl created	2026-01-11 12:00:00	2026-01-11 12:00:03	replicator-v2
9011	tr_2002	Repl updated	2026-01-11 12:05:00	2026-01-11 12:05:02	replicator-v2

Expected Output

Rows to delete (duplicates only):

event_id	entity_id	event_type	created_at	ingested_at	duplicate_rank
9001	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:05:00	2
9002	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:01:00	3

Explanation for the sample: even though 9003 arrived later (ingested_at is 10:06), it has the earliest created_at (09:59:58) and is the one to keep.

Business Context

Task

Requirements

Consider only rows where event_type = 'Repl created'.
Define duplicates as events with the same entity_id and event_type.
For each entity_id, keep exactly 1 Repl created event (the earliest by created_at, tie-break by smallest event_id).
Return the events that should be deleted, with columns: event_id, entity_id, event_type, created_at, ingested_at, and duplicate_rank (where rank 1 is the kept row, rank > 1 are duplicates).
Order the output by entity_id, then duplicate_rank, then created_at.

Table Definitions

`replication_events`

Column	Type	Description
event_id	BIGINT	Unique event row identifier (monotonic, not necessarily time-ordered)
entity_id	VARCHAR(64)	Business entity identifier (e.g., transfer_id)
event_type	VARCHAR(50)	Event name (e.g., `Repl created`, `Repl updated`)
created_at	TIMESTAMP	Timestamp when the event occurred in the source system
ingested_at	TIMESTAMP	Timestamp when the event was ingested into the warehouse
source	VARCHAR(30)	Producing service (e.g., `replicator-v2`)

Sample Data

`replication_events`

event_id	entity_id	event_type	created_at	ingested_at	source
9001	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:05:00	replicator-v2
9002	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:01:00	replicator-v2
9003	tr_1001	Repl created	2026-01-10 09:59:58	2026-01-10 10:06:00	replicator-v2
9010	tr_2002	Repl created	2026-01-11 12:00:00	2026-01-11 12:00:03	replicator-v2
9011	tr_2002	Repl updated	2026-01-11 12:05:00	2026-01-11 12:05:02	replicator-v2

Expected Output

Rows to delete (duplicates only):

event_id	entity_id	event_type	created_at	ingested_at	duplicate_rank
9001	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:05:00	2
9002	tr_1001	Repl created	2026-01-10 10:00:00	2026-01-10 10:01:00	3

Explanation for the sample: even though 9003 arrived later (ingested_at is 10:06), it has the earliest created_at (09:59:58) and is the one to keep.

Interview Guides

Business Context

Task

Requirements

Table Definitions

`replication_events`

Sample Data

`replication_events`

Expected Output

Deduplicate Out-of-Order Replication Events

Business Context

Task

Requirements

Table Definitions

`replication_events`

Sample Data

`replication_events`

Expected Output

Practice SQL

Deduplicate Out-of-Order Replication Events

Business Context

Task

Requirements

Table Definitions

`replication_events`

Sample Data

`replication_events`

Expected Output

Deduplicate Out-of-Order Replication Events

Business Context

Task

Requirements

Table Definitions

`replication_events`

Sample Data

`replication_events`

Expected Output

Practice SQL