Context
PayFlux, a digital payments company, processes card authorizations, settlements, refunds, and chargebacks through asynchronous services. Today, upstream APIs publish events to Kafka and multiple consumers update operational stores and analytics tables independently, but the platform has seen duplicate writes, out-of-order updates, and reconciliation gaps during retries and partial failures.
You need to design a data pipeline architecture that preserves data integrity across asynchronous processing while keeping near-real-time availability for finance, risk, and customer support teams.
Scale Requirements
- Throughput: 180K events/sec peak, 45K avg
- Event size: 1-4 KB Avro messages
- Daily volume: ~6.5B events/day, ~12 TB raw
- Latency target: P95 source-to-warehouse < 2 minutes
- Consistency target: No lost committed events; duplicate rate in curated tables < 0.001%
- Retention: Kafka 14 days, S3 raw 1 year, warehouse curated 7 years
Requirements
- Design ingestion and processing for asynchronous payment lifecycle events with idempotent writes and deterministic replay.
- Handle out-of-order, late, and duplicated events across authorization, settlement, refund, and chargeback topics.
- Enforce schema validation, contract versioning, and record-level lineage from producer to warehouse tables.
- Build reconciliation between Kafka offsets, raw S3 data, and Snowflake curated tables to detect missing or duplicated records.
- Support backfills and reprocessing for a single merchant, date range, or event type without corrupting downstream aggregates.
- Define orchestration, monitoring, and failure recovery for partial sink failures and consumer restarts.
- Expose finance-ready tables with exactly-once business semantics even if the transport is at-least-once.
Constraints
- AWS-first environment; use managed services where possible
- Team of 5 data engineers; limited tolerance for custom infrastructure
- PCI scope must be minimized; no raw PAN data outside tokenized upstream systems
- Monthly incremental budget cap: $40K
- Auditability required for SOX: every correction and replay must be traceable