Product Context
PulseChat is a global messaging platform used by consumers and small businesses. The infrastructure team wants an ML-assisted delivery system that decides whether an incoming delivery attempt is a true first-send, a safe retry, or a likely duplicate so the platform can preserve idempotent delivery across retries, client reconnects, and downstream failures.
Scale
| Signal | Value |
|---|
| DAU | 120M |
| Messages sent/day | 9B |
| Peak delivery-attempt QPS | 220K |
| Peak retry QPS during incidents | 500K |
| Active conversation graph | 2.5B user pairs / groups |
| Dedup / decision latency budget (p99) | 35ms |
| Retention window for idempotency keys | 7 days |
Task
Design an end-to-end ML system that helps enforce idempotent message delivery at scale. Your design should address:
- How you would frame the problem, define the prediction target, and separate deterministic idempotency guarantees from ML-based decisioning
- The full architecture: online serving path, offline training path, feature store, feedback logging, and how retries flow through the system
- A multi-stage decision pipeline (for example: fast retrieval of prior attempts/events → ranking/scoring duplicate likelihood → policy or re-ranking layer for final action)
- Model choices for each stage, including what features are available at request time and how you avoid training-serving skew
- Offline and online evaluation, including business metrics, safety guardrails, and how you would run a staged rollout
- Failure modes such as feature drift, stale state, partial outages, replay storms, and incorrect suppression of legitimate messages
Constraints
- The system must never rely on ML alone for correctness; deterministic keys and storage semantics are required for hard guarantees where possible
- User-visible duplicate deliveries are very costly, but false suppression of legitimate messages is worse for trust and compliance
- Some delivery metadata arrives late or out of order from clients and regional brokers
- Data residency rules require EU user event logs to stay in-region
- Cost target: average online decisioning cost below $0.00015 per delivery attempt
- During regional outages, retry traffic can spike 2-3x and feature freshness may degrade