Product Context
Sparksoft wants to improve personalized recommendations on its consumer content surface, where users browse a large catalog of articles, videos, and templates. The recommendation feed is a primary engagement driver, and the system must adapt to both repeat users and cold-start traffic.
Scale
| Signal | Value |
|---|
| DAU | 35M |
| Peak recommendation QPS | 180K |
| Active content catalog | 120M items |
| New items/day | 1.8M |
| Average feed request size | 20 results |
| End-to-end p99 latency budget | 180ms |
Task
Design an end-to-end ML system for Sparksoft personalized recommendations. Your design should address:
- How you would define functional and non-functional requirements, including freshness, personalization, and availability targets.
- A multi-stage recommendation architecture from candidate generation to ranking and re-ranking, with clear model choices for each stage.
- The offline training pipeline, feature computation, feature store design, and how logged feedback flows back into retraining.
- The online serving path, including latency budgeting, caching, fallbacks, and capacity planning at peak traffic.
- How you would evaluate the system offline and online, and how you would launch model changes safely.
- The top failure modes you expect in production, especially around feature drift, training-serving skew, cold start, and stale content.
Constraints
- User features should be fresh within 5 minutes; item features within 30 minutes.
- Sparksoft must support new-item cold start before engagement labels exist.
- Serving cost matters: average online inference cost should stay below $0.001 per request.
- The system must degrade gracefully if the ranker or feature store is unavailable.
- Assume privacy constraints prevent using raw PII directly in training or serving; only approved derived features may be used.