Design Real-Time Message Engagement Dashboard

Product Context

PulseChat is a business messaging platform used by marketing and support teams to send campaigns, transactional messages, and in-app notifications. Product managers want a real-time analytics dashboard that predicts and surfaces message engagement trends within minutes of send time so operators can react before a campaign underperforms.

Scale

Signal	Value
DAU	35M recipients, 120K sender-side operators
Messages sent/day	1.8B
Peak ingest QPS	220K events/sec
Active campaigns/day	4.5M
Dashboard read QPS	18K peak
Entities scored	message, campaign, audience segment, channel
Freshness target	< 2 minutes from event to dashboard
p99 dashboard query latency	< 300ms

Task

Design an end-to-end ML system that powers a real-time dashboard for message engagement. The dashboard should estimate near-term outcomes such as open rate, click-through rate, reply rate, and anomaly risk for campaigns that are still in flight.

Address the following:

Clarify the product requirements, prediction targets, and user workflows for operators using the dashboard.
Estimate system scale and propose an architecture for data ingestion, feature computation, model training, and online serving.
Design a multi-stage ML system for candidate retrieval of relevant slices/alerts, ranking of the most important insights, and optional re-ranking for diversity, deduplication, or business constraints.
Choose models for each stage and explain online vs batch inference decisions, feature freshness strategy, and storage choices.
Define offline evaluation, online experimentation, and monitoring for drift, training-serving skew, and delayed labels.
Identify major failure modes and how the system should degrade safely.

Constraints

Operators expect campaign-level metrics to update within 2 minutes, but final labels like conversions may arrive hours later.
The system must support drill-down by region, device, audience segment, and message template without exploding storage cost.
Some customers require regional data residency and cannot mix EU and US user-level data.
Cost matters: GPU-heavy serving is discouraged for the dashboard read path; most inference should run on CPU or precompute.

Product Context

Scale

Signal	Value
DAU	35M recipients, 120K sender-side operators
Messages sent/day	1.8B
Peak ingest QPS	220K events/sec
Active campaigns/day	4.5M
Dashboard read QPS	18K peak
Entities scored	message, campaign, audience segment, channel
Freshness target	< 2 minutes from event to dashboard
p99 dashboard query latency	< 300ms

Task

Address the following:

Clarify the product requirements, prediction targets, and user workflows for operators using the dashboard.
Estimate system scale and propose an architecture for data ingestion, feature computation, model training, and online serving.
Design a multi-stage ML system for candidate retrieval of relevant slices/alerts, ranking of the most important insights, and optional re-ranking for diversity, deduplication, or business constraints.
Choose models for each stage and explain online vs batch inference decisions, feature freshness strategy, and storage choices.
Define offline evaluation, online experimentation, and monitoring for drift, training-serving skew, and delayed labels.
Identify major failure modes and how the system should degrade safely.

Constraints

Operators expect campaign-level metrics to update within 2 minutes, but final labels like conversions may arrive hours later.
The system must support drill-down by region, device, audience segment, and message template without exploding storage cost.
Some customers require regional data residency and cannot mix EU and US user-level data.
Cost matters: GPU-heavy serving is discouraged for the dashboard read path; most inference should run on CPU or precompute.

Product Context

Scale

Signal	Value
DAU	35M recipients, 120K sender-side operators
Messages sent/day	1.8B
Peak ingest QPS	220K events/sec
Active campaigns/day	4.5M
Dashboard read QPS	18K peak
Entities scored	message, campaign, audience segment, channel
Freshness target	< 2 minutes from event to dashboard
p99 dashboard query latency	< 300ms

Task

Address the following:

Clarify the product requirements, prediction targets, and user workflows for operators using the dashboard.
Estimate system scale and propose an architecture for data ingestion, feature computation, model training, and online serving.
Design a multi-stage ML system for candidate retrieval of relevant slices/alerts, ranking of the most important insights, and optional re-ranking for diversity, deduplication, or business constraints.
Choose models for each stage and explain online vs batch inference decisions, feature freshness strategy, and storage choices.
Define offline evaluation, online experimentation, and monitoring for drift, training-serving skew, and delayed labels.
Identify major failure modes and how the system should degrade safely.

Constraints

Operators expect campaign-level metrics to update within 2 minutes, but final labels like conversions may arrive hours later.
The system must support drill-down by region, device, audience segment, and message template without exploding storage cost.
Some customers require regional data residency and cannot mix EU and US user-level data.
Cost matters: GPU-heavy serving is discouraged for the dashboard read path; most inference should run on CPU or precompute.

Product Context

Scale

Signal	Value
DAU	35M recipients, 120K sender-side operators
Messages sent/day	1.8B
Peak ingest QPS	220K events/sec
Active campaigns/day	4.5M
Dashboard read QPS	18K peak
Entities scored	message, campaign, audience segment, channel
Freshness target	< 2 minutes from event to dashboard
p99 dashboard query latency	< 300ms

Task

Address the following:

Clarify the product requirements, prediction targets, and user workflows for operators using the dashboard.
Estimate system scale and propose an architecture for data ingestion, feature computation, model training, and online serving.
Design a multi-stage ML system for candidate retrieval of relevant slices/alerts, ranking of the most important insights, and optional re-ranking for diversity, deduplication, or business constraints.
Choose models for each stage and explain online vs batch inference decisions, feature freshness strategy, and storage choices.
Define offline evaluation, online experimentation, and monitoring for drift, training-serving skew, and delayed labels.
Identify major failure modes and how the system should degrade safely.

Constraints

Operators expect campaign-level metrics to update within 2 minutes, but final labels like conversions may arrive hours later.
The system must support drill-down by region, device, audience segment, and message template without exploding storage cost.
Some customers require regional data residency and cannot mix EU and US user-level data.
Cost matters: GPU-heavy serving is discouraged for the dashboard read path; most inference should run on CPU or precompute.

Interview Guides

Product Context

Scale

Task

Constraints

Design Real-Time Message Engagement Dashboard

Product Context

Scale

Task

Constraints

Your Answer

Design Real-Time Message Engagement Dashboard

Product Context

Scale

Task

Constraints

Design Real-Time Message Engagement Dashboard

Product Context

Scale

Task

Constraints

Your Answer