Product Context
Voya Financial wants to score incoming financial transactions for fraud across retirement disbursements, account transfers, and linked payment activity in Voya's digital servicing surfaces. The system is used by members, contact-center agents, and risk operations teams, and must decide in real time whether to approve, step up authentication, hold for review, or decline.
Scale
| Signal | Value |
|---|
| Registered members | 9M |
| Monthly active digital users | 3.5M |
| Transactions/day | 18M |
| Peak transaction QPS | 2,500 |
| Peak feature lookups QPS | 25,000+ |
| Historical labeled transactions | 4.2B over 3 years |
| Fraud rate | ~0.18% of transactions |
| End-to-end decision latency budget (p99) | 120ms |
Fraud labels are delayed and noisy: some chargebacks or confirmed fraud cases arrive days later, while many transactions have no explicit negative label. The business cares about reducing fraud loss without causing excessive false positives that block legitimate retirement and benefits activity.
Task
Design an end-to-end ML system for real-time fraud detection at Voya Financial. Address the following:
- Clarify the product requirements, decision actions, and key business tradeoffs between fraud capture and customer friction.
- Propose a multi-stage architecture for real-time scoring, including fast candidate/rule gating, ML ranking/scoring, and a final policy or re-ranking layer for actioning.
- Design the offline and online data architecture: feature computation, feature store, training cadence, label generation, and feedback loop.
- Choose models for each stage and justify them under class imbalance, delayed labels, and strict latency constraints.
- Define offline evaluation, online rollout, monitoring, and alerting, including calibration, drift, and training-serving skew.
- Identify major failure modes and how the system should fail safely under outages or degraded dependencies.
Constraints
- Must support hard compliance and audit requirements: every decision must be explainable and reproducible.
- PII usage is restricted; sensitive features require governed access and lineage.
- Some features must be updated in near real time (device velocity, recent transfer count, IP risk), while others can be batch refreshed daily.
- False positives are expensive: blocking a legitimate retirement withdrawal or rollover creates customer and regulatory risk.
- The system must remain available during feature store or model service degradation, with deterministic fallback behavior.