Product Context
ShopNow is a large e-commerce marketplace that uses machine learning to rank products on search and recommendation surfaces. You need to design the end-to-end data pipeline that powers training and serving for these models, from raw events to online predictions.
Scale
| Signal | Value |
|---|
| DAU | 45M |
| Peak QPS (ranking requests) | 180K |
| Active product catalog | 120M SKUs |
| New/updated products per day | 8M |
| User events per day | 9B impressions/clicks/cart/purchase events |
| End-to-end serving latency budget (p99) | 150ms |
Deliverables
- Define the data sources, schemas, and storage layers needed for both offline training and online inference.
- Design the full pipeline from event ingestion and feature computation to model training, validation, deployment, and feedback logging.
- Propose a serving architecture that supports multi-stage inference (candidate retrieval, ranking, optional re-ranking) under the latency budget.
- Explain how you would prevent training-serving skew, handle feature drift, and keep feature freshness within SLA.
- Define offline and online evaluation, monitoring, rollback, and backfill strategies.
- Identify major failure modes in the pipeline and how you would detect and mitigate them.
Constraints
- User behavioral features must be available online within 2 minutes of an event.
- Product metadata updates can arrive out of order from multiple seller systems.
- Privacy policy requires deletion of user-level data within 30 days of a deletion request.
- The system must continue serving degraded but safe recommendations if the online feature store or model service is partially unavailable.
- Infra cost matters: GPU usage is allowed for training, but online ranking should primarily run on CPU unless clearly justified.