Stress-Test a Training Pipeline

Product Context

ShopNow is a large e-commerce marketplace redesigning its product recommendation and ranking stack for home page and search result personalization. You are given a proposed ML training pipeline and serving architecture, and your task is to identify where it will break when traffic, catalog size, and retraining frequency grow by 10x.

Scale

Signal	Value
DAU	45M
Peak recommendation/search QPS	220K
Active catalog	180M SKUs
New or updated items/day	14M
User events/day	9B impressions, clicks, carts, purchases
End-to-end p99 latency budget	180ms
Current retraining cadence	daily full retrain + hourly feature refresh

Assume the current proposal uses batch ETL for labels, a shared offline/online feature store, embedding-based retrieval, a learned ranker, and a lightweight re-ranker for business rules.

Task

Clarify the product goals, success metrics, and what “break at 10x scale” means across training, storage, and serving.
Identify the likely bottlenecks in the offline training pipeline, feature computation, model deployment, and online inference path.
Propose an end-to-end architecture that still works at 10x scale, including retrieval, ranking, re-ranking, and feedback logging.
Explain what should remain batch, what should move to streaming or nearline, and how to avoid training-serving skew.
Define an evaluation and monitoring plan covering offline metrics, online experiments, drift detection, and rollback criteria.
Call out the top failure modes you expect during scale-up and how you would mitigate them.

Constraints

Fresh inventory and price changes must be reflected in recommendations within 10 minutes.
User privacy policy limits raw event retention to 30 days in identifiable form.
Serving cost must stay under $0.0012 per request at peak.
The system must tolerate partial failures without returning empty recommendation sets.
Candidate teams can change feature definitions frequently, so schema evolution is a real risk.

Product Context

Scale

Signal	Value
DAU	45M
Peak recommendation/search QPS	220K
Active catalog	180M SKUs
New or updated items/day	14M
User events/day	9B impressions, clicks, carts, purchases
End-to-end p99 latency budget	180ms
Current retraining cadence	daily full retrain + hourly feature refresh

Assume the current proposal uses batch ETL for labels, a shared offline/online feature store, embedding-based retrieval, a learned ranker, and a lightweight re-ranker for business rules.

Task

Clarify the product goals, success metrics, and what “break at 10x scale” means across training, storage, and serving.
Identify the likely bottlenecks in the offline training pipeline, feature computation, model deployment, and online inference path.
Propose an end-to-end architecture that still works at 10x scale, including retrieval, ranking, re-ranking, and feedback logging.
Explain what should remain batch, what should move to streaming or nearline, and how to avoid training-serving skew.
Define an evaluation and monitoring plan covering offline metrics, online experiments, drift detection, and rollback criteria.
Call out the top failure modes you expect during scale-up and how you would mitigate them.

Constraints

Fresh inventory and price changes must be reflected in recommendations within 10 minutes.
User privacy policy limits raw event retention to 30 days in identifiable form.
Serving cost must stay under $0.0012 per request at peak.
The system must tolerate partial failures without returning empty recommendation sets.
Candidate teams can change feature definitions frequently, so schema evolution is a real risk.

Product Context

Scale

Signal	Value
DAU	45M
Peak recommendation/search QPS	220K
Active catalog	180M SKUs
New or updated items/day	14M
User events/day	9B impressions, clicks, carts, purchases
End-to-end p99 latency budget	180ms
Current retraining cadence	daily full retrain + hourly feature refresh

Assume the current proposal uses batch ETL for labels, a shared offline/online feature store, embedding-based retrieval, a learned ranker, and a lightweight re-ranker for business rules.

Task

Clarify the product goals, success metrics, and what “break at 10x scale” means across training, storage, and serving.
Identify the likely bottlenecks in the offline training pipeline, feature computation, model deployment, and online inference path.
Propose an end-to-end architecture that still works at 10x scale, including retrieval, ranking, re-ranking, and feedback logging.
Explain what should remain batch, what should move to streaming or nearline, and how to avoid training-serving skew.
Define an evaluation and monitoring plan covering offline metrics, online experiments, drift detection, and rollback criteria.
Call out the top failure modes you expect during scale-up and how you would mitigate them.

Constraints

Fresh inventory and price changes must be reflected in recommendations within 10 minutes.
User privacy policy limits raw event retention to 30 days in identifiable form.
Serving cost must stay under $0.0012 per request at peak.
The system must tolerate partial failures without returning empty recommendation sets.
Candidate teams can change feature definitions frequently, so schema evolution is a real risk.

Product Context

Scale

Signal	Value
DAU	45M
Peak recommendation/search QPS	220K
Active catalog	180M SKUs
New or updated items/day	14M
User events/day	9B impressions, clicks, carts, purchases
End-to-end p99 latency budget	180ms
Current retraining cadence	daily full retrain + hourly feature refresh

Assume the current proposal uses batch ETL for labels, a shared offline/online feature store, embedding-based retrieval, a learned ranker, and a lightweight re-ranker for business rules.

Task

Clarify the product goals, success metrics, and what “break at 10x scale” means across training, storage, and serving.
Identify the likely bottlenecks in the offline training pipeline, feature computation, model deployment, and online inference path.
Propose an end-to-end architecture that still works at 10x scale, including retrieval, ranking, re-ranking, and feedback logging.
Explain what should remain batch, what should move to streaming or nearline, and how to avoid training-serving skew.
Define an evaluation and monitoring plan covering offline metrics, online experiments, drift detection, and rollback criteria.
Call out the top failure modes you expect during scale-up and how you would mitigate them.

Constraints

Fresh inventory and price changes must be reflected in recommendations within 10 minutes.
User privacy policy limits raw event retention to 30 days in identifiable form.
Serving cost must stay under $0.0012 per request at peak.
The system must tolerate partial failures without returning empty recommendation sets.
Candidate teams can change feature definitions frequently, so schema evolution is a real risk.

Interview Guides

Product Context

Scale

Task

Constraints

Stress-Test a Training Pipeline

Product Context

Scale

Task

Constraints

Your Answer

Stress-Test a Training Pipeline

Product Context

Scale

Task

Constraints

Stress-Test a Training Pipeline

Product Context

Scale

Task

Constraints

Your Answer