Product Context
ShopSphere is a large e-commerce marketplace. You need to design a reliable, scalable ML-powered search and recommendation stack that returns relevant products when users search or browse, while continuing to serve traffic during failures or traffic spikes.
Scale
| Signal | Value |
|---|
| DAU | 35M |
| Peak search/browse QPS | 180K |
| Active product catalog | 120M SKUs |
| New or updated items/day | 4M |
| Average candidates before ranking | 20K per request |
| End-to-end p99 latency budget | 150ms |
Task
Design an end-to-end ML system that is both reliable and scalable for product retrieval and ranking. Address the following:
- Clarify the product requirements, reliability goals, and success metrics for search and browse ranking.
- Propose a multi-stage architecture for candidate retrieval, ranking, and re-ranking, including fallback behavior when components fail or exceed latency budgets.
- Choose models and features for each stage, and explain what should run online versus batch.
- Design the training and data pipeline, including labels, feature freshness, and how to avoid training-serving skew.
- Define offline and online evaluation, monitoring, alerting, and rollout strategy.
- Identify key failure modes at scale and how the system should detect and mitigate them.
Constraints
- Product availability, price, and inventory can change within minutes and must not be stale in results.
- 25% of traffic comes from anonymous or cold-start users with little history.
- Serving cost matters: GPU usage is allowed only in the final ranking stage if clearly justified.
- The system must meet 99.95% availability and degrade gracefully to simpler retrieval or popularity-based results during outages.
- Some features (e.g., user demographics) cannot be used directly for ranking due to compliance and fairness requirements.