Product Context
Agero wants a real-time ML decision service for its roadside assistance dispatch workflow. When a driver requests help, the system should score and rank eligible service providers for that event so Agero can make a low-latency assignment recommendation to the dispatch operations surface.
Scale
| Signal | Value |
|---|
| Monthly roadside events | 9M |
| Peak decision QPS | 1,200 requests/sec during weather spikes |
| Active service provider network | 80K providers / towing assets |
| Candidate providers per event | 50-400 after geo and policy filters |
| Historical training data | ~250M dispatch / ETA / outcome records |
| End-to-end latency budget (p99) | 150ms |
Task
Design an end-to-end ML system on AWS using Python and SageMaker for this dispatch decision service. Your design should address:
- How you would define the prediction target and success metrics for provider selection in Agero's dispatch flow
- The full architecture for offline training and online inference, including feature computation, model deployment, and request routing
- A multi-stage decision pipeline, including candidate retrieval/filtering, ranking, and any final policy or re-ranking layer
- How you would support both real-time features (vehicle location, provider availability, weather, traffic) and batch features (historical acceptance rate, completion rate, average ETA error)
- How you would evaluate the system offline and online, and how you would safely roll out model changes
- The main failure modes, including feature drift, training-serving skew, stale provider state, and degraded downstream dependencies
Constraints
- The service must return a recommendation within 150ms p99 and remain available during regional surge events
- Provider availability and ETA-related features must be fresh within 1-2 minutes
- Some labels are delayed: completion outcomes and customer satisfaction may arrive hours later
- The system must be explainable enough for operations teams to understand why a provider was recommended
- Cost matters: avoid an architecture that requires GPUs for every online request unless clearly justified