You are designing the ML serving stack for an ad ranking system in a large social app. The system must score ad candidates for feed requests while balancing relevance, revenue, and user experience. Some predictions can be precomputed in batch, while others must be generated in real time from fresh user context. The business goal is to improve ad performance without violating tight latency budgets on the main feed.
| Signal | Value |
|---|---|
| DAU | 180M |
| Peak feed request QPS | 900K |
| Ad catalog | 25M active ads |
| Candidates scored per request | 2,000 retrieved -> 150 ranked -> 10 returned |
| p99 latency budget | 120ms end-to-end |
| New/updated ads per day | 3M |
| User feature freshness target | < 5 minutes |
How would you design the end-to-end ML system, including what to serve from batch versus online inference, and how those choices affect retrieval, ranking, feature computation, evaluation, monitoring, and failure handling at this scale?