Product Context
Meta's Network Operations Center relies on an internal dashboard to monitor the health of global backbone traffic across inter-region links, POPs, and data centers. Design an ML-driven system that detects, prioritizes, and surfaces likely network incidents in real time so SREs and network engineers can triage the right issues first.
Scale
| Signal | Value |
|---|
| Backbone devices and links | ~120K devices, ~350K logical links |
| Telemetry streams | ~15M metrics/sec globally |
| Peak alert-evaluation QPS | ~250K entity evaluations/sec |
| Historical training data | ~18 months of logs and incidents |
| Candidate anomalies per minute | ~500K raw anomalies before dedup |
| Dashboard refresh target | every 5 seconds |
| End-to-end detection latency budget | p99 < 10 seconds from metric arrival |
Task
- Clarify the product goals, users, and what “health” means for Meta's global backbone network.
- Design the end-to-end ML system, including telemetry ingestion, candidate anomaly generation, ranking, and dashboard presentation.
- Choose models for each stage and explain why a multi-stage pipeline is better than a single detector.
- Define the online vs. batch architecture, feature storage, retraining cadence, and feedback loop from incidents back into training.
- Propose offline and online evaluation, including how you would validate usefulness for on-call engineers.
- Identify major failure modes such as feature drift, training-serving skew, missing telemetry, and alert storms, and explain mitigations.
Constraints
- False negatives are costly because missed backbone incidents can impact multiple Meta surfaces simultaneously.
- False positives are also expensive because noisy alerts burn operator attention and increase MTTR.
- Some labels are delayed or weak: incident tickets may be created minutes after the first symptom, and many anomalies never become tickets.
- The system must operate across regions with partial data loss, clock skew, and heterogeneous device vendors.
- Cost matters: the online path should favor CPU-first inference, with heavier models limited to later stages or batch analysis.