Product Context
RouteSphere is a global logistics platform used by dispatchers, carriers, and warehouse operators to move parcels across air, sea, and ground networks. Design an ML-driven disaster recovery system that recommends fallback routes and recovery actions when disruptions such as port closures, storms, labor strikes, customs delays, or hub outages occur.
Scale
| Signal | Value |
|---|
| Daily active operators | 450K |
| Shipments tracked per day | 28M |
| Peak disruption-related decision QPS | 18K |
| Global facilities / hubs / ports / lanes | 120K nodes, 2.5M edges |
| Active shipment graph state updates | ~75K events/sec |
| Per-decision latency budget (p99) | 350ms |
Task
- Clarify the product goal: what decisions the system should automate vs recommend to human operators during disruptions.
- Design an end-to-end ML system for disruption detection, candidate recovery-plan generation, ranking, and final re-ranking under operational constraints.
- Specify the offline and online architecture, including feature pipelines, feature store design, model training cadence, and serving path.
- Choose models for each stage and explain tradeoffs across accuracy, latency, interpretability, and regional generalization.
- Define offline and online evaluation, including business guardrails such as SLA adherence, cost-to-serve, and fairness across regions/customers.
- Identify major failure modes, especially feature drift, training-serving skew, stale network state, and regional outages affecting the ML stack itself.
Constraints
- The system must continue operating during partial regional cloud outages and degraded upstream data feeds.
- Some labels are delayed by hours or days because final delivery outcomes arrive late.
- Certain recovery actions require human approval for regulated lanes or hazardous goods.
- Cost matters: only the highest-value ranking stage may use GPUs; retrieval and most scoring should run on CPUs.
- Recommendations must be auditable: operators need reason codes for why a reroute or hold decision was suggested.