You are designing the candidate retrieval layer for a personalized feed in a large consumer social app. On each refresh, the system must quickly narrow a massive content corpus to a few thousand candidates before downstream ranking and re-ranking. The goal is to improve relevance and freshness while keeping serving latency low enough for an interactive feed experience. You are considering a two-tower model for retrieval and need an end-to-end design that can operate reliably at large scale.
| Signal | Value |
|---|---|
| DAU | 180M |
| Peak feed refresh QPS | 900K |
| Active content catalog | 600M items |
| New items per day | 10M |
| Retrieval output per request | 5K candidates |
| End-to-end feed latency budget (p99) | 180ms |
| Retrieval stage latency budget (p99) | 25ms |
How would you design this two-tower retrieval system end to end, including training, indexing, online serving, and integration with downstream ranking? What trade-offs would you consider around embedding design, freshness, latency, evaluation, and operational reliability at this scale?