Product Context
PulseText is a cloud messaging platform used by banks, ride-share apps, and e-commerce companies to send transactional and promotional SMS worldwide. Design an ML-driven system that ingests, processes, prioritizes, and delivers millions of SMS messages per minute while maximizing delivery success and minimizing spam risk, carrier throttling, and user harm.
Scale
| Signal | Value |
|---|
| Active business accounts | 1.2M |
| Daily active sending accounts | 180K |
| Peak ingest rate | 4M SMS/minute (~67K/sec) |
| Peak delivery attempts | 120K/sec |
| Destination phone numbers/day | 350M |
| Countries / carrier routes | 190+ countries, 800+ carrier routes |
| Historical message corpus | 40B SMS events |
| p99 decision latency budget | 120ms before enqueue |
Task
- Clarify the product objective and define what the ML system should optimize: delivery rate, latency, fraud/spam prevention, route selection, and prioritization under congestion.
- Design an end-to-end multi-stage ML architecture for online message handling, including candidate route retrieval, route ranking, and final policy-based re-ranking or filtering.
- Specify the offline and online data pipelines, labels, feature computation, and how you avoid training-serving skew across message, sender, recipient, and carrier features.
- Propose model choices for each stage and explain tradeoffs between accuracy, latency, interpretability, and operational cost.
- Define evaluation: offline metrics, online experiments, business guardrails, and segment-level analysis for high-risk senders, new routes, and international traffic.
- Identify major failure modes at scale, including feature drift, feedback delay, carrier outages, abuse attacks, and stale routing models.
Constraints
- Some labels are delayed: final delivery receipts may arrive seconds to hours later depending on carrier.
- Compliance constraints vary by country; certain content classes or senders must be blocked or audited.
- The system must degrade safely: if ML is unavailable, transactional traffic must still flow.
- Cost matters: the platform cannot score every route with a heavyweight model for every message.
- New senders and new carrier routes create persistent cold-start problems.