Product Context
ShopNow is a large e-commerce marketplace whose homepage, search results, and recommendation widgets are still powered by a legacy monolith. The company wants to break it into distributed microservices while introducing a modern ML ranking stack that can serve personalized product recommendations and search ranking reliably at scale.
Scale
| Signal | Value |
|---|
| DAU | 45M |
| Peak QPS (homepage + search + rec widgets) | 220K |
| Product catalog | 120M active SKUs |
| New/updated items per day | 8M |
| User events per day | 3.5B impressions/clicks/add-to-carts/orders |
| p99 latency budget | 180ms end-to-end |
Task
Design how you would evolve the legacy monolith into a distributed ML-driven architecture. Address the following:
- Clarify the product requirements, migration goals, and success metrics for both platform reliability and ranking quality.
- Propose a target architecture that decomposes the monolith into services for retrieval, ranking, feature serving, logging, model management, and fallback serving.
- Design the end-to-end ML system: offline training, online inference, candidate generation, ranking, and optional re-ranking/business rules.
- Explain how you would migrate incrementally from the monolith to microservices without breaking user experience, including shadow traffic, canaries, and rollback.
- Define offline and online evaluation, plus monitoring for feature drift, training-serving skew, and service-level failures.
- Identify key failure modes and the tradeoffs between latency, relevance, operational complexity, and cost.
Constraints
- Existing monolith currently returns results in ~140ms p99, so the new system cannot materially regress latency.
- Some features are only available in batch today; near-real-time user features must be added gradually.
- Search and recommendation surfaces must share core infrastructure, but ranking logic can differ by surface.
- Compliance requires deletion of user-level features within 30 days of account deletion.
- During migration, at least 99.95% availability must be maintained and every service must have a safe fallback path.