Design Reliable Product Search Ranking

Product Context

ShopSphere is a large e-commerce marketplace. You need to design a reliable, scalable ML-powered search and recommendation stack that returns relevant products when users search or browse, while continuing to serve traffic during failures or traffic spikes.

Scale

Signal	Value
DAU	35M
Peak search/browse QPS	180K
Active product catalog	120M SKUs
New or updated items/day	4M
Average candidates before ranking	20K per request
End-to-end p99 latency budget	150ms

Task

Design an end-to-end ML system that is both reliable and scalable for product retrieval and ranking. Address the following:

Clarify the product requirements, reliability goals, and success metrics for search and browse ranking.
Propose a multi-stage architecture for candidate retrieval, ranking, and re-ranking, including fallback behavior when components fail or exceed latency budgets.
Choose models and features for each stage, and explain what should run online versus batch.
Design the training and data pipeline, including labels, feature freshness, and how to avoid training-serving skew.
Define offline and online evaluation, monitoring, alerting, and rollout strategy.
Identify key failure modes at scale and how the system should detect and mitigate them.

Constraints

Product availability, price, and inventory can change within minutes and must not be stale in results.
25% of traffic comes from anonymous or cold-start users with little history.
Serving cost matters: GPU usage is allowed only in the final ranking stage if clearly justified.
The system must meet 99.95% availability and degrade gracefully to simpler retrieval or popularity-based results during outages.
Some features (e.g., user demographics) cannot be used directly for ranking due to compliance and fairness requirements.

Signal

Value

DAU

35M

Peak search/browse QPS

180K

Active product catalog

120M SKUs

New or updated items/day

Average candidates before ranking

20K per request

End-to-end p99 latency budget

150ms

Task

Design an end-to-end ML system that is both reliable and scalable for product retrieval and ranking. Address the following:

Clarify the product requirements, reliability goals, and success metrics for search and browse ranking.

Propose a multi-stage architecture for candidate retrieval, ranking, and re-ranking, including fallback behavior when components fail or exceed latency budgets.

Choose models and features for each stage, and explain what should run online versus batch.

Design the training and data pipeline, including labels, feature freshness, and how to avoid training-serving skew.

Define offline and online evaluation, monitoring, alerting, and rollout strategy.

Identify key failure modes at scale and how the system should detect and mitigate them.

Constraints

Product availability, price, and inventory can change within minutes and must not be stale in results.

25% of traffic comes from anonymous or cold-start users with little history.

Serving cost matters: GPU usage is allowed only in the final ranking stage if clearly justified.

The system must meet 99.95% availability and degrade gracefully to simpler retrieval or popularity-based results during outages.

Some features (e.g., user demographics) cannot be used directly for ranking due to compliance and fairness requirements.

Signal

Value

DAU

35M

Peak search/browse QPS

180K

Active product catalog

120M SKUs

New or updated items/day

Average candidates before ranking

20K per request

End-to-end p99 latency budget

150ms

Task

Design an end-to-end ML system that is both reliable and scalable for product retrieval and ranking. Address the following:

Clarify the product requirements, reliability goals, and success metrics for search and browse ranking.

Propose a multi-stage architecture for candidate retrieval, ranking, and re-ranking, including fallback behavior when components fail or exceed latency budgets.

Choose models and features for each stage, and explain what should run online versus batch.

Design the training and data pipeline, including labels, feature freshness, and how to avoid training-serving skew.

Define offline and online evaluation, monitoring, alerting, and rollout strategy.

Identify key failure modes at scale and how the system should detect and mitigate them.

Constraints

Product availability, price, and inventory can change within minutes and must not be stale in results.

25% of traffic comes from anonymous or cold-start users with little history.

Serving cost matters: GPU usage is allowed only in the final ranking stage if clearly justified.

The system must meet 99.95% availability and degrade gracefully to simpler retrieval or popularity-based results during outages.

Some features (e.g., user demographics) cannot be used directly for ranking due to compliance and fairness requirements.

Signal

Value

DAU

35M

Peak search/browse QPS

180K

Active product catalog

120M SKUs

New or updated items/day

Average candidates before ranking

20K per request

End-to-end p99 latency budget

150ms

Task

Design an end-to-end ML system that is both reliable and scalable for product retrieval and ranking. Address the following:

Clarify the product requirements, reliability goals, and success metrics for search and browse ranking.

Propose a multi-stage architecture for candidate retrieval, ranking, and re-ranking, including fallback behavior when components fail or exceed latency budgets.

Choose models and features for each stage, and explain what should run online versus batch.

Design the training and data pipeline, including labels, feature freshness, and how to avoid training-serving skew.

Define offline and online evaluation, monitoring, alerting, and rollout strategy.

Identify key failure modes at scale and how the system should detect and mitigate them.

Constraints

Product availability, price, and inventory can change within minutes and must not be stale in results.

25% of traffic comes from anonymous or cold-start users with little history.

Serving cost matters: GPU usage is allowed only in the final ranking stage if clearly justified.

The system must meet 99.95% availability and degrade gracefully to simpler retrieval or popularity-based results during outages.

Some features (e.g., user demographics) cannot be used directly for ranking due to compliance and fairness requirements.

Interview Guides

Product Context

Scale

Task

Constraints

Design Reliable Product Search Ranking

Product Context

Scale

Task

Constraints

Your Answer

Design Reliable Product Search Ranking

Product Context

Scale

Task

Constraints

Design Reliable Product Search Ranking

Product Context

Scale

Task

Constraints

Your Answer