
You are designing a personalized recommendation system for a large e-commerce marketplace. The system powers the home feed and product detail page recommendations, and it is expected to improve conversion, basket size, and repeat engagement. Users should see relevant products based on their behavior, context, and catalog changes, while new products must become discoverable quickly. The system must operate reliably in production and support rapid iteration by data science and engineering teams.
| Signal | Value |
|---|---|
| DAU | 18M |
| Peak recommendation QPS | 85K |
| Active product catalog | 120M SKUs |
| New or updated products/day | 3M |
| Per-request latency budget (p99) | 180ms |
| Average candidates scored/request | 5K |
How would you design this end-to-end machine learning system for production, including the data pipeline, model stages, serving architecture, evaluation approach, and how you would handle drift, skew, and operational failures at scale?