Context
LedgerFlow, a B2B payments platform, exposes a core API for payment creation, refunds, and ledger mutations. Today, retries from clients, load balancers, and downstream workers can produce duplicate side effects, and the current batch reconciliation process detects issues hours later. You need to design an idempotency framework as a data pipeline problem: capture requests, deduplicate safely across synchronous API and asynchronous processing, and provide replay, auditability, and monitoring.
Scale Requirements
- Traffic: 35K API requests/second peak, 8K average
- Payload size: 1-8 KB JSON per request
- Idempotency window: 72 hours for external clients, 30 days audit retention in hot storage, 1 year in cold storage
- Latency target: P99 API overhead from idempotency checks < 25 ms
- Durability: No duplicate side effects for committed operations under retries, worker restarts, or network timeouts
- Storage: ~2.5B idempotency records/month
Requirements
- Design a framework that guarantees the same idempotency key + request fingerprint returns the same result without re-executing side effects.
- Support both online request handling and async pipeline stages (Kafka consumers, Airflow backfills, replay jobs).
- Define the storage model for idempotency keys, request hashes, response snapshots, processing state, TTL, and audit metadata.
- Handle race conditions from concurrent duplicate requests across multiple API pods and regions.
- Prevent false reuse when the same key is sent with a different payload.
- Support reprocessing/backfills while preserving idempotent writes into downstream warehouse tables.
- Include monitoring, alerting, dead-letter handling, and operational recovery procedures.
Constraints
- AWS-first stack; existing services use EKS, PostgreSQL, Kafka, Airflow, and S3
- Budget increase capped at $40K/month
- PCI and SOX audit requirements; immutable audit trail required
- Cross-region active/passive failover; RPO < 5 minutes, RTO < 30 minutes