Design Agero Dispatch Decision Service

Product Context

Agero wants a real-time ML decision service for its roadside assistance dispatch workflow. When a driver requests help, the system should score and rank eligible service providers for that event so Agero can make a low-latency assignment recommendation to the dispatch operations surface.

Scale

Signal	Value
Monthly roadside events	9M
Peak decision QPS	1,200 requests/sec during weather spikes
Active service provider network	80K providers / towing assets
Candidate providers per event	50-400 after geo and policy filters
Historical training data	~250M dispatch / ETA / outcome records
End-to-end latency budget (p99)	150ms

Task

Design an end-to-end ML system on AWS using Python and SageMaker for this dispatch decision service. Your design should address:

How you would define the prediction target and success metrics for provider selection in Agero's dispatch flow
The full architecture for offline training and online inference, including feature computation, model deployment, and request routing
A multi-stage decision pipeline, including candidate retrieval/filtering, ranking, and any final policy or re-ranking layer
How you would support both real-time features (vehicle location, provider availability, weather, traffic) and batch features (historical acceptance rate, completion rate, average ETA error)
How you would evaluate the system offline and online, and how you would safely roll out model changes
The main failure modes, including feature drift, training-serving skew, stale provider state, and degraded downstream dependencies

Constraints

The service must return a recommendation within 150ms p99 and remain available during regional surge events
Provider availability and ETA-related features must be fresh within 1-2 minutes
Some labels are delayed: completion outcomes and customer satisfaction may arrive hours later
The system must be explainable enough for operations teams to understand why a provider was recommended
Cost matters: avoid an architecture that requires GPUs for every online request unless clearly justified

Product Context

Scale

Signal	Value
Monthly roadside events	9M
Peak decision QPS	1,200 requests/sec during weather spikes
Active service provider network	80K providers / towing assets
Candidate providers per event	50-400 after geo and policy filters
Historical training data	~250M dispatch / ETA / outcome records
End-to-end latency budget (p99)	150ms

Task

Design an end-to-end ML system on AWS using Python and SageMaker for this dispatch decision service. Your design should address:

How you would define the prediction target and success metrics for provider selection in Agero's dispatch flow
The full architecture for offline training and online inference, including feature computation, model deployment, and request routing
A multi-stage decision pipeline, including candidate retrieval/filtering, ranking, and any final policy or re-ranking layer
How you would support both real-time features (vehicle location, provider availability, weather, traffic) and batch features (historical acceptance rate, completion rate, average ETA error)
How you would evaluate the system offline and online, and how you would safely roll out model changes
The main failure modes, including feature drift, training-serving skew, stale provider state, and degraded downstream dependencies

Constraints

The service must return a recommendation within 150ms p99 and remain available during regional surge events
Provider availability and ETA-related features must be fresh within 1-2 minutes
Some labels are delayed: completion outcomes and customer satisfaction may arrive hours later
The system must be explainable enough for operations teams to understand why a provider was recommended
Cost matters: avoid an architecture that requires GPUs for every online request unless clearly justified

Product Context

Scale

Signal	Value
Monthly roadside events	9M
Peak decision QPS	1,200 requests/sec during weather spikes
Active service provider network	80K providers / towing assets
Candidate providers per event	50-400 after geo and policy filters
Historical training data	~250M dispatch / ETA / outcome records
End-to-end latency budget (p99)	150ms

Task

Design an end-to-end ML system on AWS using Python and SageMaker for this dispatch decision service. Your design should address:

How you would define the prediction target and success metrics for provider selection in Agero's dispatch flow
The full architecture for offline training and online inference, including feature computation, model deployment, and request routing
A multi-stage decision pipeline, including candidate retrieval/filtering, ranking, and any final policy or re-ranking layer
How you would support both real-time features (vehicle location, provider availability, weather, traffic) and batch features (historical acceptance rate, completion rate, average ETA error)
How you would evaluate the system offline and online, and how you would safely roll out model changes
The main failure modes, including feature drift, training-serving skew, stale provider state, and degraded downstream dependencies

Constraints

The service must return a recommendation within 150ms p99 and remain available during regional surge events
Provider availability and ETA-related features must be fresh within 1-2 minutes
Some labels are delayed: completion outcomes and customer satisfaction may arrive hours later
The system must be explainable enough for operations teams to understand why a provider was recommended
Cost matters: avoid an architecture that requires GPUs for every online request unless clearly justified

Product Context

Scale

Signal	Value
Monthly roadside events	9M
Peak decision QPS	1,200 requests/sec during weather spikes
Active service provider network	80K providers / towing assets
Candidate providers per event	50-400 after geo and policy filters
Historical training data	~250M dispatch / ETA / outcome records
End-to-end latency budget (p99)	150ms

Task

Design an end-to-end ML system on AWS using Python and SageMaker for this dispatch decision service. Your design should address:

How you would define the prediction target and success metrics for provider selection in Agero's dispatch flow
The full architecture for offline training and online inference, including feature computation, model deployment, and request routing
A multi-stage decision pipeline, including candidate retrieval/filtering, ranking, and any final policy or re-ranking layer
How you would support both real-time features (vehicle location, provider availability, weather, traffic) and batch features (historical acceptance rate, completion rate, average ETA error)
How you would evaluate the system offline and online, and how you would safely roll out model changes
The main failure modes, including feature drift, training-serving skew, stale provider state, and degraded downstream dependencies

Constraints

The service must return a recommendation within 150ms p99 and remain available during regional surge events
Provider availability and ETA-related features must be fresh within 1-2 minutes
Some labels are delayed: completion outcomes and customer satisfaction may arrive hours later
The system must be explainable enough for operations teams to understand why a provider was recommended
Cost matters: avoid an architecture that requires GPUs for every online request unless clearly justified

Interview Guides

Product Context

Scale

Task

Constraints

Design Agero Dispatch Decision Service

Product Context

Scale

Task

Constraints

Your Answer

Design Agero Dispatch Decision Service

Product Context

Scale

Task

Constraints

Design Agero Dispatch Decision Service

Product Context

Scale

Task

Constraints

Your Answer