Product Context
Devalore Commerce is the company’s e-commerce platform, and the personalized recommendation modules on the home feed, product detail pages, and cart page are key drivers of conversion. Design an end-to-end recommendation system that helps shoppers discover relevant products while balancing relevance, freshness, and business constraints.
Scale
| Signal | Value |
|---|
| DAU | 18M |
| Peak recommendation QPS | 85K |
| Active catalog | 45M SKUs |
| New or updated SKUs/day | 1.2M |
| Avg recommendation slots/request | 20 |
| End-to-end p99 latency budget | 180ms |
Assume traffic is split across three major surfaces in Devalore Commerce: homepage recommendations, similar items on product detail pages, and cart cross-sell recommendations. Users generate implicit feedback such as impressions, clicks, add-to-cart, purchases, dwell time, and skips. Product metadata includes category, brand, price, discount, seller, inventory, and text/image embeddings computed offline.
Task
Design the recommendation system and explain the major tradeoffs. Address the following:
- Clarify the product goals, success metrics, and the differences across homepage, PDP, and cart recommendation surfaces.
- Propose a multi-stage architecture for candidate generation, ranking, and re-ranking, including how you would handle cold-start users and new products.
- Define the offline training pipeline, feature store strategy, label construction, and how you would avoid training-serving skew.
- Describe the online serving architecture, including latency budget allocation, caching, fallback behavior, and capacity planning at peak traffic.
- Define an evaluation plan covering offline metrics, online experiments, guardrails, and segment-level analysis.
- Identify likely failure modes at scale, including feature drift, stale inventory, popularity bias, and monitoring gaps.
Constraints
- Recommendations must exclude out-of-stock items and respect user-level blocked brands/sellers.
- Freshness matters: inventory, price, and promotions can change within minutes.
- Cost matters: the system should primarily run on CPU online; GPU use should be limited to offline training or a small high-value ranking tier.
- Compliance: do not use sensitive attributes directly for personalization.
- The system should degrade gracefully if personalization features are missing or delayed.