Real-Time Backbone Traffic Health Monitor

Product Context

Meta's Network Operations Center relies on an internal dashboard to monitor the health of global backbone traffic across inter-region links, POPs, and data centers. Design an ML-driven system that detects, prioritizes, and surfaces likely network incidents in real time so SREs and network engineers can triage the right issues first.

Scale

Signal	Value
Backbone devices and links	~120K devices, ~350K logical links
Telemetry streams	~15M metrics/sec globally
Peak alert-evaluation QPS	~250K entity evaluations/sec
Historical training data	~18 months of logs and incidents
Candidate anomalies per minute	~500K raw anomalies before dedup
Dashboard refresh target	every 5 seconds
End-to-end detection latency budget	p99 < 10 seconds from metric arrival

Task

Clarify the product goals, users, and what “health” means for Meta's global backbone network.
Design the end-to-end ML system, including telemetry ingestion, candidate anomaly generation, ranking, and dashboard presentation.
Choose models for each stage and explain why a multi-stage pipeline is better than a single detector.
Define the online vs. batch architecture, feature storage, retraining cadence, and feedback loop from incidents back into training.
Propose offline and online evaluation, including how you would validate usefulness for on-call engineers.
Identify major failure modes such as feature drift, training-serving skew, missing telemetry, and alert storms, and explain mitigations.

Constraints

False negatives are costly because missed backbone incidents can impact multiple Meta surfaces simultaneously.
False positives are also expensive because noisy alerts burn operator attention and increase MTTR.
Some labels are delayed or weak: incident tickets may be created minutes after the first symptom, and many anomalies never become tickets.
The system must operate across regions with partial data loss, clock skew, and heterogeneous device vendors.
Cost matters: the online path should favor CPU-first inference, with heavier models limited to later stages or batch analysis.

Product Context

Scale

Signal	Value
Backbone devices and links	~120K devices, ~350K logical links
Telemetry streams	~15M metrics/sec globally
Peak alert-evaluation QPS	~250K entity evaluations/sec
Historical training data	~18 months of logs and incidents
Candidate anomalies per minute	~500K raw anomalies before dedup
Dashboard refresh target	every 5 seconds
End-to-end detection latency budget	p99 < 10 seconds from metric arrival

Task

Clarify the product goals, users, and what “health” means for Meta's global backbone network.
Design the end-to-end ML system, including telemetry ingestion, candidate anomaly generation, ranking, and dashboard presentation.
Choose models for each stage and explain why a multi-stage pipeline is better than a single detector.
Define the online vs. batch architecture, feature storage, retraining cadence, and feedback loop from incidents back into training.
Propose offline and online evaluation, including how you would validate usefulness for on-call engineers.
Identify major failure modes such as feature drift, training-serving skew, missing telemetry, and alert storms, and explain mitigations.

Constraints

False negatives are costly because missed backbone incidents can impact multiple Meta surfaces simultaneously.
False positives are also expensive because noisy alerts burn operator attention and increase MTTR.
Some labels are delayed or weak: incident tickets may be created minutes after the first symptom, and many anomalies never become tickets.
The system must operate across regions with partial data loss, clock skew, and heterogeneous device vendors.
Cost matters: the online path should favor CPU-first inference, with heavier models limited to later stages or batch analysis.

Product Context

Scale

Signal	Value
Backbone devices and links	~120K devices, ~350K logical links
Telemetry streams	~15M metrics/sec globally
Peak alert-evaluation QPS	~250K entity evaluations/sec
Historical training data	~18 months of logs and incidents
Candidate anomalies per minute	~500K raw anomalies before dedup
Dashboard refresh target	every 5 seconds
End-to-end detection latency budget	p99 < 10 seconds from metric arrival

Task

Clarify the product goals, users, and what “health” means for Meta's global backbone network.
Design the end-to-end ML system, including telemetry ingestion, candidate anomaly generation, ranking, and dashboard presentation.
Choose models for each stage and explain why a multi-stage pipeline is better than a single detector.
Define the online vs. batch architecture, feature storage, retraining cadence, and feedback loop from incidents back into training.
Propose offline and online evaluation, including how you would validate usefulness for on-call engineers.
Identify major failure modes such as feature drift, training-serving skew, missing telemetry, and alert storms, and explain mitigations.

Constraints

False negatives are costly because missed backbone incidents can impact multiple Meta surfaces simultaneously.
False positives are also expensive because noisy alerts burn operator attention and increase MTTR.
Some labels are delayed or weak: incident tickets may be created minutes after the first symptom, and many anomalies never become tickets.
The system must operate across regions with partial data loss, clock skew, and heterogeneous device vendors.
Cost matters: the online path should favor CPU-first inference, with heavier models limited to later stages or batch analysis.

Product Context

Scale

Signal	Value
Backbone devices and links	~120K devices, ~350K logical links
Telemetry streams	~15M metrics/sec globally
Peak alert-evaluation QPS	~250K entity evaluations/sec
Historical training data	~18 months of logs and incidents
Candidate anomalies per minute	~500K raw anomalies before dedup
Dashboard refresh target	every 5 seconds
End-to-end detection latency budget	p99 < 10 seconds from metric arrival

Task

Clarify the product goals, users, and what “health” means for Meta's global backbone network.
Design the end-to-end ML system, including telemetry ingestion, candidate anomaly generation, ranking, and dashboard presentation.
Choose models for each stage and explain why a multi-stage pipeline is better than a single detector.
Define the online vs. batch architecture, feature storage, retraining cadence, and feedback loop from incidents back into training.
Propose offline and online evaluation, including how you would validate usefulness for on-call engineers.
Identify major failure modes such as feature drift, training-serving skew, missing telemetry, and alert storms, and explain mitigations.

Constraints

False negatives are costly because missed backbone incidents can impact multiple Meta surfaces simultaneously.
False positives are also expensive because noisy alerts burn operator attention and increase MTTR.
Some labels are delayed or weak: incident tickets may be created minutes after the first symptom, and many anomalies never become tickets.
The system must operate across regions with partial data loss, clock skew, and heterogeneous device vendors.
Cost matters: the online path should favor CPU-first inference, with heavier models limited to later stages or batch analysis.

Interview Guides

Product Context

Scale

Task

Constraints

Real-Time Backbone Traffic Health Monitor

Product Context

Scale

Task

Constraints

Your Answer

Real-Time Backbone Traffic Health Monitor

Product Context

Scale

Task

Constraints

Real-Time Backbone Traffic Health Monitor

Product Context

Scale

Task

Constraints

Your Answer