Product Context
PulseChat is a business messaging platform used by marketing and support teams to send campaigns, transactional messages, and in-app notifications. Product managers want a real-time analytics dashboard that predicts and surfaces message engagement trends within minutes of send time so operators can react before a campaign underperforms.
Scale
| Signal | Value |
|---|
| DAU | 35M recipients, 120K sender-side operators |
| Messages sent/day | 1.8B |
| Peak ingest QPS | 220K events/sec |
| Active campaigns/day | 4.5M |
| Dashboard read QPS | 18K peak |
| Entities scored | message, campaign, audience segment, channel |
| Freshness target | < 2 minutes from event to dashboard |
| p99 dashboard query latency | < 300ms |
Task
Design an end-to-end ML system that powers a real-time dashboard for message engagement. The dashboard should estimate near-term outcomes such as open rate, click-through rate, reply rate, and anomaly risk for campaigns that are still in flight.
Address the following:
- Clarify the product requirements, prediction targets, and user workflows for operators using the dashboard.
- Estimate system scale and propose an architecture for data ingestion, feature computation, model training, and online serving.
- Design a multi-stage ML system for candidate retrieval of relevant slices/alerts, ranking of the most important insights, and optional re-ranking for diversity, deduplication, or business constraints.
- Choose models for each stage and explain online vs batch inference decisions, feature freshness strategy, and storage choices.
- Define offline evaluation, online experimentation, and monitoring for drift, training-serving skew, and delayed labels.
- Identify major failure modes and how the system should degrade safely.
Constraints
- Operators expect campaign-level metrics to update within 2 minutes, but final labels like conversions may arrive hours later.
- The system must support drill-down by region, device, audience segment, and message template without exploding storage cost.
- Some customers require regional data residency and cannot mix EU and US user-level data.
- Cost matters: GPU-heavy serving is discouraged for the dashboard read path; most inference should run on CPU or precompute.