You are redesigning the payment processing pipeline for a digital banking platform after an audit found duplicate captures and inconsistent downstream ledger records during retries, consumer restarts, and backfills. The platform is moving from synchronous request-driven writes to an event-driven architecture so payment authorization, capture, settlement, and reconciliation can be processed independently. The main pain point is that the same business operation may be delivered multiple times by API clients, brokers, and batch replay jobs, but it must produce exactly one financial effect. You need a design that keeps operational systems and analytical stores consistent without blocking throughput.
| Component | Status / Technology |
|---|---|
| Payment API | Java 17 / Spring Boot, client retries with 5s timeout |
| Event Bus | Apache Kafka 3.x, at-least-once delivery |
| Stream Processing | Apache Flink 1.18 for payment state transitions |
| Operational Store | PostgreSQL 14 for payment and idempotency records |
| Data Lake / Warehouse | S3-compatible object storage + Greenplum |
| Orchestration | Apache Airflow 2.x for replay and reconciliation jobs |
| Scale: 25K payment requests/sec peak, 3K avg, 1.2B payment events/day, P99 API latency target under 300 ms, settlement and ledger views under 2 minutes fresh, replay windows up to 30 days. |
How would you design an idempotency framework across the API, streaming pipeline, storage layers, and replay workflows so duplicate requests and duplicate events never create duplicate financial side effects, while still supporting retries, late events, and backfills at this scale?