Product Context
ShopSphere is a global e-commerce marketplace. During checkout, every payment attempt must be scored in real time to decide whether to approve, step-up authenticate, or block the transaction; downtime or bad predictions directly impact revenue and fraud losses.
Scale
| Signal | Value |
|---|
| DAU | 45M shoppers |
| Peak checkout QPS | 28K payment attempts/sec |
| Average QPS | 9K/sec |
| Merchants | 1.2M |
| Active cards / payment instruments | 180M |
| Historical transactions for training | 9B over 24 months |
| End-to-end decision latency budget (p99) | 120ms |
| Availability target | 99.99% |
Task
Design a highly available ML system for checkout risk scoring. Your design should address both prediction quality and graceful degradation when dependencies fail.
- Clarify the product objective, decision actions, and error costs (false approve vs false decline).
- Propose an end-to-end architecture covering feature generation, candidate policy evaluation, ML scoring, and final decisioning.
- Define the online and offline paths: training data, feature store, model training cadence, deployment, and rollback.
- Explain how you would meet strict availability and latency requirements across regions while avoiding training-serving skew.
- Define offline and online evaluation, including business guardrails and segment-level monitoring.
- Identify major failure modes, especially dependency outages, feature drift, stale features, and model degradation during traffic spikes.
Constraints
- The service is in the critical path of checkout: if it is unavailable, carts are abandoned within seconds.
- Some labels are delayed: confirmed fraud may arrive days to weeks later via chargebacks.
- Regulatory and audit requirements require explainable decisions and immutable decision logs.
- Cross-border traffic creates feature sparsity and cold-start for new merchants, devices, and payment instruments.
- Cost matters: the online path must primarily run on CPU, with limited room for expensive deep models.