Context
PayLink, a B2B payments platform, ingests transaction, ledger, and webhook events from PostgreSQL, partner APIs, and regional payment gateways. The current pipeline is a mix of hourly batch ETL and ad hoc consumers, which causes duplicate financial records and inconsistent downstream balances during network partitions.
You are asked to redesign the pipeline and explain how CAP theorem trade-offs apply to your design. The platform must continue processing payment events during partial outages, while preserving correctness for finance reporting and reconciliation.
Scale Requirements
- Throughput: 120K events/sec peak, 25K avg
- Event size: 1-4 KB JSON/Avro
- Daily volume: ~2.5B events/day, ~6 TB raw
- Latency: fraud and ops dashboards < 10 seconds; finance warehouse < 5 minutes
- Retention: raw immutable events for 180 days; curated financial tables for 7 years
Requirements
- Design an ingestion and processing architecture for payment events, CDC updates, and third-party webhooks.
- Explain where the system chooses consistency over availability and where it chooses availability over consistency under partition scenarios.
- Support idempotent processing, replay, and backfills without double-counting transactions.
- Produce two outputs:
- low-latency operational views for fraud/risk
- strongly governed finance tables for reconciliation and reporting
- Define data quality checks for schema drift, duplicate events, missing ledger entries, and out-of-order delivery.
- Describe orchestration, monitoring, and incident response for lag, failed loads, and partition-related inconsistencies.
Constraints
- AWS-first environment; existing stack includes Kafka, Airflow, S3, and Snowflake
- PCI scope applies to payment attributes; PII must be tokenized before landing in analytics stores
- Incremental budget cap: $35K/month
- Team of 5 data engineers; solution should avoid excessive operational complexity
- Regional gateway outages and cross-AZ network partitions are expected failure modes
Your answer should explicitly connect design choices to CAP theorem: for example, whether fraud dashboards can tolerate eventual consistency, while ledger reconciliation requires linearizable or transactionally consistent writes.