Product Context
ShopNow is a large e-commerce marketplace redesigning its product recommendation and ranking stack for home page and search result personalization. You are given a proposed ML training pipeline and serving architecture, and your task is to identify where it will break when traffic, catalog size, and retraining frequency grow by 10x.
Scale
| Signal | Value |
|---|
| DAU | 45M |
| Peak recommendation/search QPS | 220K |
| Active catalog | 180M SKUs |
| New or updated items/day | 14M |
| User events/day | 9B impressions, clicks, carts, purchases |
| End-to-end p99 latency budget | 180ms |
| Current retraining cadence | daily full retrain + hourly feature refresh |
Assume the current proposal uses batch ETL for labels, a shared offline/online feature store, embedding-based retrieval, a learned ranker, and a lightweight re-ranker for business rules.
Task
- Clarify the product goals, success metrics, and what “break at 10x scale” means across training, storage, and serving.
- Identify the likely bottlenecks in the offline training pipeline, feature computation, model deployment, and online inference path.
- Propose an end-to-end architecture that still works at 10x scale, including retrieval, ranking, re-ranking, and feedback logging.
- Explain what should remain batch, what should move to streaming or nearline, and how to avoid training-serving skew.
- Define an evaluation and monitoring plan covering offline metrics, online experiments, drift detection, and rollback criteria.
- Call out the top failure modes you expect during scale-up and how you would mitigate them.
Constraints
- Fresh inventory and price changes must be reflected in recommendations within 10 minutes.
- User privacy policy limits raw event retention to 30 days in identifiable form.
- Serving cost must stay under $0.0012 per request at peak.
- The system must tolerate partial failures without returning empty recommendation sets.
- Candidate teams can change feature definitions frequently, so schema evolution is a real risk.