Design ML Data Pipeline

Product Context

ShopNow is a large e-commerce marketplace that uses machine learning to rank products on search and recommendation surfaces. You need to design the end-to-end data pipeline that powers training and serving for these models, from raw events to online predictions.

Scale

Signal	Value
DAU	45M
Peak QPS (ranking requests)	180K
Active product catalog	120M SKUs
New/updated products per day	8M
User events per day	9B impressions/clicks/cart/purchase events
End-to-end serving latency budget (p99)	150ms

Deliverables

Define the data sources, schemas, and storage layers needed for both offline training and online inference.
Design the full pipeline from event ingestion and feature computation to model training, validation, deployment, and feedback logging.
Propose a serving architecture that supports multi-stage inference (candidate retrieval, ranking, optional re-ranking) under the latency budget.
Explain how you would prevent training-serving skew, handle feature drift, and keep feature freshness within SLA.
Define offline and online evaluation, monitoring, rollback, and backfill strategies.
Identify major failure modes in the pipeline and how you would detect and mitigate them.

Constraints

User behavioral features must be available online within 2 minutes of an event.
Product metadata updates can arrive out of order from multiple seller systems.
Privacy policy requires deletion of user-level data within 30 days of a deletion request.
The system must continue serving degraded but safe recommendations if the online feature store or model service is partially unavailable.
Infra cost matters: GPU usage is allowed for training, but online ranking should primarily run on CPU unless clearly justified.

Signal

Value

DAU

45M

Peak QPS (ranking requests)

180K

Active product catalog

120M SKUs

New/updated products per day

User events per day

9B impressions/clicks/cart/purchase events

End-to-end serving latency budget (p99)

150ms

Deliverables

Define the data sources, schemas, and storage layers needed for both offline training and online inference.

Design the full pipeline from event ingestion and feature computation to model training, validation, deployment, and feedback logging.

Propose a serving architecture that supports multi-stage inference (candidate retrieval, ranking, optional re-ranking) under the latency budget.

Explain how you would prevent training-serving skew, handle feature drift, and keep feature freshness within SLA.

Define offline and online evaluation, monitoring, rollback, and backfill strategies.

Identify major failure modes in the pipeline and how you would detect and mitigate them.

Constraints

User behavioral features must be available online within 2 minutes of an event.

Product metadata updates can arrive out of order from multiple seller systems.

The system must continue serving degraded but safe recommendations if the online feature store or model service is partially unavailable.

Infra cost matters: GPU usage is allowed for training, but online ranking should primarily run on CPU unless clearly justified.

Signal

Value

DAU

45M

Peak QPS (ranking requests)

180K

Active product catalog

120M SKUs

New/updated products per day

User events per day

9B impressions/clicks/cart/purchase events

End-to-end serving latency budget (p99)

150ms

Deliverables

Define the data sources, schemas, and storage layers needed for both offline training and online inference.

Design the full pipeline from event ingestion and feature computation to model training, validation, deployment, and feedback logging.

Propose a serving architecture that supports multi-stage inference (candidate retrieval, ranking, optional re-ranking) under the latency budget.

Explain how you would prevent training-serving skew, handle feature drift, and keep feature freshness within SLA.

Define offline and online evaluation, monitoring, rollback, and backfill strategies.

Identify major failure modes in the pipeline and how you would detect and mitigate them.

Constraints

User behavioral features must be available online within 2 minutes of an event.

Product metadata updates can arrive out of order from multiple seller systems.

The system must continue serving degraded but safe recommendations if the online feature store or model service is partially unavailable.

Infra cost matters: GPU usage is allowed for training, but online ranking should primarily run on CPU unless clearly justified.

Signal

Value

DAU

45M

Peak QPS (ranking requests)

180K

Active product catalog

120M SKUs

New/updated products per day

User events per day

9B impressions/clicks/cart/purchase events

End-to-end serving latency budget (p99)

150ms

Deliverables

Define the data sources, schemas, and storage layers needed for both offline training and online inference.

Design the full pipeline from event ingestion and feature computation to model training, validation, deployment, and feedback logging.

Propose a serving architecture that supports multi-stage inference (candidate retrieval, ranking, optional re-ranking) under the latency budget.

Explain how you would prevent training-serving skew, handle feature drift, and keep feature freshness within SLA.

Define offline and online evaluation, monitoring, rollback, and backfill strategies.

Identify major failure modes in the pipeline and how you would detect and mitigate them.

Constraints

User behavioral features must be available online within 2 minutes of an event.

Product metadata updates can arrive out of order from multiple seller systems.

The system must continue serving degraded but safe recommendations if the online feature store or model service is partially unavailable.

Infra cost matters: GPU usage is allowed for training, but online ranking should primarily run on CPU unless clearly justified.

Interview Guides

Product Context

Scale

Deliverables

Constraints

Design ML Data Pipeline

Product Context

Scale

Deliverables

Constraints

Your Answer

Design ML Data Pipeline

Product Context

Scale

Deliverables

Constraints

Design ML Data Pipeline

Product Context

Scale

Deliverables

Constraints

Your Answer