You are designing the ML layer for a real-time chat application with one-to-one and group messaging. The core transport system already handles basic send, ack, and retry semantics, but delivery quality varies across device state, network conditions, and recipient availability. Your team wants an ML system that predicts delivery risk in real time so the app can choose retry strategy, push escalation, and queue prioritization to improve delivered-within-seconds rate without breaking message ordering. This directly impacts user trust and engagement on the messaging surface.
| Signal | Value |
|---|---|
| DAU | 900M |
| Peak message send QPS | 3.5M |
| Peak delivery-attempt QPS | 12M |
| Group chats share of messages | 18% |
| Max common group size | 512 members |
| p99 decision latency budget | 25ms |
| Feature freshness target | < 60s |
How would you design this end-to-end ML system so it can score delivery risk and choose delivery actions in real time at scale, while preserving message ordering guarantees and handling delayed labels, feature drift, training-serving skew, and online failures?