Product Context
Design an ML system for Apple Card authorization fraud detection. The system scores each card-present or card-not-present transaction in real time and decides whether to approve, step up with additional verification, or decline, while minimizing false declines for legitimate customers.
Scale
| Signal | Value |
|---|
| Apple Card active users | 35M |
| Peak authorization QPS | 120K |
| Average QPS | 45K |
| Transactions per day | ~3.8B authorization events/month |
| Merchant population | 60M+ merchants globally |
| Device graph size | 500M+ Apple devices/accounts/cards linked entities |
| End-to-end decision latency budget | 200ms p99 |
| Feature freshness target | <5s for streaming counters |
Task
- Clarify the product and risk requirements: what actions the system can take, acceptable fraud loss, and customer experience constraints.
- Size the system and propose an end-to-end architecture for training, feature computation, and online serving.
- Design a multi-stage decision pipeline (fast rules / candidate risk retrieval / ML ranking / policy layer) and justify model choices per stage.
- Define the online vs batch feature strategy, including how you would build an online/offline feature store and avoid training-serving skew.
- Propose offline and online evaluation, thresholding, and rollout strategy, including how to handle delayed labels and concept drift.
- Identify key failure modes, monitoring, and fallback behavior under partial outages or model degradation.
Constraints
- The system must return a decision in under 200ms p99, including feature fetches and network overhead.
- Fraud labels are delayed and partially observed due to chargebacks, disputes, and manual investigations.
- The system must satisfy financial compliance, auditability, and explainability requirements for adverse actions.
- Customer friction is expensive: unnecessary declines or step-ups hurt trust and transaction conversion.
- Some features are only available in batch; others must be computed from streaming events in near real time.
- The design should support global traffic bursts (e.g., holiday shopping) and regional failover without materially increasing fraud loss.