Product Context
ShopNow is a large e-commerce marketplace. Its search and recommendation surfaces use an ML ranking system to order products for shoppers, and the team wants a robust post-deployment monitoring design that catches feature drift, model quality regressions, and training-serving skew before they materially hurt revenue.
Scale
| Signal | Value |
|---|
| DAU | 35M |
| Peak ranking QPS | 180K requests/sec |
| Active product catalog | 120M SKUs |
| Candidate set per request | ~5K retrieved → 300 ranked → 40 re-ranked |
| End-to-end p99 latency budget | 120ms |
| New / updated SKUs per day | 9M |
| Daily impression events | ~4.5B |
Task
Design the end-to-end ML system and monitoring strategy for this ranking stack after deployment. Address the following:
- Clarify the product objective, prediction target, and what “model quality” means online versus offline.
- Propose a multi-stage architecture (retrieval → ranking → re-ranking) and explain which features are computed batch vs near-real-time vs fully online.
- Design a monitoring framework for feature drift, label drift, calibration, delayed feedback, and training-serving skew across the pipeline.
- Define how you would evaluate the system offline and online, including alert thresholds, dashboards, and rollback criteria.
- Identify likely failure modes at this scale and explain how the system degrades safely when features or models are stale, missing, or unhealthy.
Constraints
- p99 latency must stay under 120ms globally.
- The business requires fresh inventory and price changes reflected within 10 minutes.
- Some labels are delayed: purchases can occur hours after impression.
- Cost matters: the online ranker must primarily run on CPU, with limited GPU budget for offline training only.
- The system must support auditable monitoring for regulated categories where ranking changes can affect seller exposure.