Design Idempotent Payment Event Pipeline

Scenario

You are redesigning the payment processing pipeline for a digital banking platform after an audit found duplicate captures and inconsistent downstream ledger records during retries, consumer restarts, and backfills. The platform is moving from synchronous request-driven writes to an event-driven architecture so payment authorization, capture, settlement, and reconciliation can be processed independently. The main pain point is that the same business operation may be delivered multiple times by API clients, brokers, and batch replay jobs, but it must produce exactly one financial effect. You need a design that keeps operational systems and analytical stores consistent without blocking throughput.

Current State

Component	Status / Technology
Payment API	Java 17 / Spring Boot, client retries with 5s timeout
Event Bus	Apache Kafka 3.x, at-least-once delivery
Stream Processing	Apache Flink 1.18 for payment state transitions
Operational Store	PostgreSQL 14 for payment and idempotency records
Data Lake / Warehouse	S3-compatible object storage + Greenplum
Orchestration	Apache Airflow 2.x for replay and reconciliation jobs

Scale: 25K payment requests/sec peak, 3K avg, 1.2B payment events/day, P99 API latency target under 300 ms, settlement and ledger views under 2 minutes fresh, replay windows up to 30 days.

Question

How would you design an idempotency framework across the API, streaming pipeline, storage layers, and replay workflows so duplicate requests and duplicate events never create duplicate financial side effects, while still supporting retries, late events, and backfills at this scale?

Scenario

Current State

Component	Status / Technology
Payment API	Java 17 / Spring Boot, client retries with 5s timeout
Event Bus	Apache Kafka 3.x, at-least-once delivery
Stream Processing	Apache Flink 1.18 for payment state transitions
Operational Store	PostgreSQL 14 for payment and idempotency records
Data Lake / Warehouse	S3-compatible object storage + Greenplum
Orchestration	Apache Airflow 2.x for replay and reconciliation jobs

Scale: 25K payment requests/sec peak, 3K avg, 1.2B payment events/day, P99 API latency target under 300 ms, settlement and ledger views under 2 minutes fresh, replay windows up to 30 days.

Scenario

Current State

Component	Status / Technology
Payment API	Java 17 / Spring Boot, client retries with 5s timeout
Event Bus	Apache Kafka 3.x, at-least-once delivery
Stream Processing	Apache Flink 1.18 for payment state transitions
Operational Store	PostgreSQL 14 for payment and idempotency records
Data Lake / Warehouse	S3-compatible object storage + Greenplum
Orchestration	Apache Airflow 2.x for replay and reconciliation jobs

Scale: 25K payment requests/sec peak, 3K avg, 1.2B payment events/day, P99 API latency target under 300 ms, settlement and ledger views under 2 minutes fresh, replay windows up to 30 days.

Scenario

Current State

Component	Status / Technology
Payment API	Java 17 / Spring Boot, client retries with 5s timeout
Event Bus	Apache Kafka 3.x, at-least-once delivery
Stream Processing	Apache Flink 1.18 for payment state transitions
Operational Store	PostgreSQL 14 for payment and idempotency records
Data Lake / Warehouse	S3-compatible object storage + Greenplum
Orchestration	Apache Airflow 2.x for replay and reconciliation jobs

Scale: 25K payment requests/sec peak, 3K avg, 1.2B payment events/day, P99 API latency target under 300 ms, settlement and ledger views under 2 minutes fresh, replay windows up to 30 days.

Interview Guides

Scenario

Current State

Question

Design Idempotent Payment Event Pipeline

Scenario

Current State

Question

Your Answer

Design Idempotent Payment Event Pipeline

Scenario

Current State

Question

Design Idempotent Payment Event Pipeline

Scenario

Current State

Question

Your Answer