Product Context
ShopSphere is a large e-commerce marketplace. During peak events like Black Friday, the checkout flow is business-critical, and the company wants an ML-driven service that predicts checkout risk in real time so the platform can route users to safer payment, fraud, and fulfillment paths without hurting conversion.
Scale
| Signal | Value |
|---|
| DAU | 45M shoppers |
| Peak checkout starts | 180K/min |
| Peak QPS to risk service | 35K QPS |
| Payment methods | 120 globally |
| Merchants / sellers | 3.5M |
| Active SKUs | 220M |
| End-to-end checkout latency budget | 800ms p99 |
| ML service latency budget | 60ms p99 |
Task
Design an end-to-end ML system that keeps checkout highly available during peak events while minimizing false declines and customer friction.
- Clarify the prediction target, decision points, and what actions the service can trigger during checkout.
- Propose the online and offline architecture, including feature computation, model training, serving, and fallback behavior under dependency failures.
- Design a multi-stage decision system if appropriate (for example: fast retrieval/rules triage → risk ranking → policy re-ranking / action selection).
- Define how you would evaluate the system offline and online, including business metrics and guardrails during Black Friday traffic spikes.
- Identify major failure modes such as feature drift, training-serving skew, hot keys, stale features, and cascading failures from downstream services.
- Explain capacity planning, regional failover, and how the system degrades gracefully when models or features are unavailable.
Constraints
- The service sits on the critical path of checkout and must not become a single point of failure.
- Labels are delayed and partially observed: chargebacks may arrive weeks later, while payment authorization failures arrive immediately.
- Some features contain sensitive payment and user data; design for PCI/PII minimization and regional data residency.
- Peak-event traffic can be 8-10x normal baseline within minutes.
- A simpler fallback path must preserve checkout availability even if ML quality drops.