Product Context
ShopNow is a global e-commerce marketplace. Its homepage recommendation service ranks products in real time for signed-in users, and the business cannot tolerate significant downtime because recommendation outages directly reduce revenue.
Scale
| Signal | Value |
|---|
| DAU | 85M |
| Peak homepage recommendation QPS | 220K |
| Regions | 3 active regions: us-east, eu-west, ap-southeast |
| Active catalog | 120M products |
| New/updated items per day | 9M |
| Per-request latency budget (p99) | 180ms end-to-end |
| Availability target | 99.99% |
Task
Design an end-to-end ML system that continues serving recommendations during a full regional outage, partial dependency outage, or model-serving degradation.
Address the following:
- Clarify the functional and non-functional requirements, including what “graceful degradation” means for this product.
- Propose a multi-stage recommendation architecture (retrieval → ranking → re-ranking) that supports multi-region active-active or active-passive failover.
- Explain how training, feature computation, model deployment, and feature stores should work across regions to avoid training-serving skew and stale features during failover.
- Define the serving architecture, traffic routing, replication strategy, and fallback behavior when a region, feature store, ANN index, or ranker becomes unavailable.
- Describe how you would evaluate the system offline and online, including reliability metrics, recommendation quality metrics, and failover drills.
- Identify key failure modes, how you would detect them quickly, and how the system should mitigate them automatically.
Constraints
- p99 latency must remain under 180ms in steady state and under 250ms during failover.
- User features should be fresher than 5 minutes; item features fresher than 15 minutes.
- Cross-region data transfer is expensive, so not every feature can be synchronously replicated.
- Some user data is residency-constrained in the EU and cannot be freely copied to all regions.
- The system must support a degraded fallback experience even if personalized ranking is unavailable.
- Model rollouts must be safe: a bad model should not propagate globally without guardrails.