Product Context
Agero uses ML to predict roadside assistance dispatch ETAs and job outcomes inside its dispatch and service-provider operations workflows. These predictions are consumed by internal operators and downstream automation, so silent model degradation can increase member wait times, bad provider assignments, and unnecessary support contacts.
Scale
| Signal | Value |
|---|
| Roadside events/day | 1.2M |
| Peak scoring QPS | 2,500 |
| Active service providers | 80K |
| Geographic cells served | 25K |
| Features per request | ~250 |
| End-to-end prediction latency budget (p99) | 120ms |
| Training history retained | 24 months |
Task
Design the end-to-end production monitoring system for Agero's ETA and dispatch-ranking models. Your design should cover both the prediction service and the data/ML pipelines behind it.
- Define the functional and non-functional requirements for monitoring, alerting, and safe rollback.
- Propose the serving architecture for online scoring, feature retrieval, logging, and fallback behavior when the model or feature pipeline is unhealthy.
- Specify what metrics you would track to detect data quality issues, feature drift, concept drift, training-serving skew, calibration problems, and business impact regressions.
- Describe the offline and online evaluation strategy, including delayed-label handling for ETA accuracy and provider acceptance outcomes.
- Explain how monitoring feeds back into retraining, incident response, and release management.
- Identify the top failure modes at Agero scale and how you would detect and mitigate them.
Constraints
- ETA labels are delayed and noisy because actual arrival time depends on provider acceptance, traffic, cancellations, and manual overrides.
- Some features are near-real-time (traffic, provider location, queue depth), while others are batch-updated daily.
- The system must remain available during feature store outages and partial regional failures.
- Monitoring must support segmented analysis by geography, provider cohort, line of business, and time of day.
- Compliance requirement: logs used for monitoring and training must avoid storing unnecessary member PII in raw form.