Product Context
PayShield is a global card issuer and payment processor. Every card authorization request must be scored for fraud in real time before the bank decides to approve, decline, or step up with additional verification.
Scale
| Signal | Value |
|---|
| Active cardholders | 45M |
| Merchants | 8M |
| Peak authorization QPS | 120K txns/sec |
| Average daily transactions | 3.2B |
| Historical labeled transactions | 18 months, ~1.4T rows |
| End-to-end decision latency budget | 150ms p99 |
| ML scoring budget within decision flow | 35ms p99 |
| Chargeback / fraud label delay | 7-45 days |
Task
Design an end-to-end ML system for real-time fraud detection on credit card transactions. Your design should address:
- How you would define the prediction target, business objective, and decision policy (approve / decline / review / step-up authentication)
- The serving architecture for low-latency scoring at 120K QPS, including online features, batch features, and fallback behavior
- A multi-stage decision pipeline, such as lightweight rules or retrieval for known bad entities, followed by ML ranking/scoring and optional re-decision logic
- Model choices for each stage and how you would handle delayed labels, class imbalance, concept drift, and cold-start merchants/cards/devices
- Offline evaluation, online rollout, threshold tuning, and monitoring for model quality, calibration, and operational health
- Key failure modes, including feature drift, training-serving skew, outages, adversarial adaptation, and false-positive spikes
Constraints
- False positives are expensive: unnecessary declines hurt customer trust and interchange revenue
- False negatives are also expensive: fraud losses and chargeback costs are material
- The system must support region-specific compliance requirements; some raw PII cannot be stored in the online feature store
- Features must be explainable enough to support analyst review and adverse-action workflows
- Fraud patterns shift quickly during attacks, so some signals must update within seconds to minutes
- The authorization path cannot depend on a GPU-only service or any single regional dependency