Product Context
StyleSnap is a shopping app where users search using text, images, or both (for example, uploading a photo of shoes and adding “similar but cheaper”). Design the end-to-end multimodal retrieval and ranking system that returns relevant products from a large catalog.
Scale
| Signal | Value |
|---|
| DAU | 35M |
| Peak search QPS | 180K |
| Product catalog | 120M active SKUs |
| New/updated items per day | 4M |
| Queries with image input | 22% |
| Per-request latency budget (p99) | 250ms end-to-end |
Task
- Clarify the product requirements and success metrics for multimodal search.
- Propose a multi-stage architecture for retrieval, ranking, and re-ranking at this scale.
- Choose model families for each stage and explain how text, image, and metadata signals are combined.
- Design the offline and online data pipelines, including feature storage, training cadence, and index refresh.
- Define offline evaluation, online experimentation, and monitoring for drift, skew, and quality regressions.
- Identify key failure modes, fallback behavior, and cost/latency tradeoffs.
Constraints
- The system must support text-only, image-only, and text+image queries in one API.
- Raw image models larger than 1 GB cannot run synchronously per request for cost reasons.
- Newly added products should become searchable within 15 minutes.
- The marketplace has strict policy filters: blocked brands, unsafe content, and region-specific compliance rules must be enforced before final ranking.
- Mobile clients are sensitive to tail latency; p99 above 250ms causes measurable search abandonment.