Context
PayWave, a digital payments platform processing card-not-present transactions, currently runs hourly batch fraud scoring with Apache Spark on S3-backed transaction logs. Fraud analysts want sub-second blocking for high-risk payments, but the finance team still needs complete, reconciled datasets for investigations, chargebacks, and model retraining.
You need to design a fraud data pipeline and explain the trade-offs between batch and stream processing, including whether to use one approach or a hybrid architecture.
Scale Requirements
- Transaction volume: 120K transactions/second peak, 25K average
- Event size: ~1.5 KB JSON per authorization event
- Daily data volume: ~8 TB raw transaction + device + merchant events
- Decision latency: P95 < 300 ms for online fraud decisions
- Batch SLA: Reconciled fraud fact tables available within 30 minutes of hour close
- Retention: 13 months hot storage, 7 years archived for compliance
Requirements
- Design ingestion for real-time transaction, device fingerprint, and merchant risk events.
- Support online fraud scoring for payment authorization decisions with low latency.
- Build batch reconciliation to correct late, duplicated, or out-of-order events and produce investigation-ready tables.
- Define how features such as card velocity, merchant anomaly counts, and device reuse are computed in streaming vs batch.
- Ensure idempotent processing, replay capability, and auditable lineage from raw event to fraud decision.
- Describe orchestration, monitoring, and failure recovery for both real-time and batch paths.
- Explain the trade-offs in accuracy, latency, cost, operational complexity, and recovery when choosing batch, streaming, or hybrid.
Constraints
- Existing stack is AWS-first: MSK, S3, EMR, Airflow, Snowflake
- Incremental budget is capped at $40K/month
- PCI and SOC 2 controls apply; PII must be encrypted in transit and at rest
- Team has strong Spark/Airflow experience but limited expertise operating low-latency stateful streaming at scale