Product Context
ShopNow is a large e-commerce marketplace. Its homepage recommendations are a tier-1 ML service: if they are unavailable or degraded, revenue drops immediately and users see a worse shopping experience.
Scale
| Signal | Value |
|---|
| DAU | 45M |
| Peak homepage recommendation QPS | 220K |
| Active item catalog | 120M SKUs |
| New/updated items per day | 8M |
| Per-request latency budget (p99) | 180ms |
| Regions | 3 active regions |
| Availability target | 99.99% |
Task
Design the end-to-end ML system for homepage recommendations with high availability and disaster recovery as first-class requirements. Assume the service uses a multi-stage pipeline and powers the first screen users see when opening the app.
Address the following:
- Clarify the functional and non-functional requirements, including what “tier-1” means for this ML service.
- Propose a multi-stage architecture (retrieval → ranking → re-ranking) and explain how it remains available during partial failures, regional outages, stale features, or model-serving incidents.
- Design the offline and online data/feature pipelines, including how you avoid training-serving skew and recover safely after data pipeline failures.
- Explain your serving architecture: active-active vs active-passive, cross-region failover, fallback paths, caching, capacity planning, and how you meet the latency SLO during failover.
- Define offline and online evaluation, plus operational monitoring for availability, model quality, drift, and disaster-recovery readiness.
- Identify the top failure modes and mitigations, including feature drift, corrupted model artifacts, feature store outages, ANN index staleness, and full-region loss.
Constraints
- Homepage must always return something useful, even if personalized ranking is unavailable.
- User features should be fresher than 5 minutes; item features should be fresher than 15 minutes.
- Compliance requires user data to stay within its home geography; cross-region replication must respect this.
- Infra cost matters: you cannot simply run every stage at 3x peak in all regions.
- Recovery objectives: RTO < 10 minutes for a regional outage, RPO < 5 minutes for interaction logs.