You are designing the recommendation stack for a social content feed where users browse posts, videos, and creators in a ranked home surface. A major product problem is cold start: many users arrive with little or no interaction history, and a large share of new content has no engagement labels yet. The business wants to improve early-session engagement without overfitting to only globally popular content. You need a system that can personalize quickly as signals arrive while still serving high-quality recommendations on the first few requests.
| Signal | Value |
|---|---|
| DAU | 180M |
| Peak feed request QPS | 900K |
| Active content catalog | 600M items |
| New items per day | 10M |
| New users per day | 4M |
| Per-request latency budget (p99) | 180ms |
How would you design this end-to-end recommendation and ranking system to handle cold-start users and cold-start items at this scale? Explain the architecture, model choices, online and offline components, evaluation plan, and how you would monitor and mitigate failure modes such as drift, skew, and poor exploration behavior.