Product Context
ShopNow is a large e-commerce marketplace. Its search and recommendation surfaces use a live ranking model to order products for shoppers, and ranking quality directly affects revenue, conversion, and user trust.
Scale
| Signal | Value |
|---|
| DAU | 45M |
| Peak ranking QPS | 180K |
| Active product catalog | 120M SKUs |
| New/updated items per day | 8M |
| Candidates per request | 5K retrieved 300 ranked 40 re-ranked |
| End-to-end p99 latency budget | 150ms |
| Model retrain cadence | Daily full retrain, hourly feature refresh |
Task
Design an end-to-end system for monitoring and responding to drift in a live ranking model. Your design should cover both the ranking stack and the operational loop that detects issues before they materially hurt business metrics.
Address the following:
- Define the functional and non-functional requirements for drift monitoring in a multi-stage retrieval ranking re-ranking system.
- Propose the online and offline architecture, including logging, feature storage, model serving, and how drift signals are computed.
- Specify what kinds of drift you would monitor (feature drift, label drift, concept drift, training-serving skew, segment-specific regressions), how you would detect them, and what thresholds or alerting strategy you would use.
- Explain how the system should respond when drift is detected: rollback, traffic shifting, retraining, feature disabling, fallback ranking, or human review.
- Define an evaluation strategy that combines offline validation, online experiments, and ongoing production monitoring.
- Identify key failure modes and mitigations, especially around delayed labels, sparse segments, and false-positive alerts.
Constraints
- Ranking service must stay within 150ms p99, so monitoring cannot add heavy synchronous overhead.
- Click and conversion labels are delayed and biased by position/exposure.
- Some high-value segments are small (e.g., luxury, new users, low-inventory categories), so aggregate metrics may hide regressions.
- Compliance requires feature lineage and auditable rollback decisions.
- Infra budget allows only lightweight online checks; most heavy analysis must run asynchronously.