Product Context
Meta operates a large global network spanning datacenters, backbone links, and edge POPs that support Facebook, Instagram, WhatsApp, and internal infrastructure. Design an ML system that detects network anomalies, triages likely root causes, and recommends or triggers directed repair actions for SiteOps and network automation systems.
Scale
| Signal | Value |
|---|
| Network devices monitored | 2.5M routers, switches, optics, and hosts |
| Telemetry events/day | 45B counters, logs, alerts, and flow summaries |
| Peak telemetry ingest | 1.2M events/sec |
| Concurrent incidents/day | 25K anomaly clusters |
| Candidate repair actions | 5K runbooks / automation actions |
| Online latency budget | 3s p99 from anomaly trigger to ranked repair plan |
Task
- Clarify the functional goals across detection, incident clustering, root-cause triage, and repair recommendation.
- Design an end-to-end multi-stage ML architecture, including retrieval, ranking, and re-ranking of candidate causes and repair actions.
- Specify the offline and online data pipelines, feature store design, and training cadence for fast-changing network conditions.
- Choose models for each stage and explain tradeoffs in accuracy, latency, interpretability, and operator trust.
- Define evaluation strategy, including offline metrics, online rollout, human-in-the-loop validation, and guardrails.
- Identify failure modes such as feature drift, training-serving skew, bad automated actions, and telemetry outages, with detection and mitigation plans.
Constraints
- False positives are expensive: unnecessary repairs can impact production traffic.
- Some actions may be auto-executed only for low-risk classes; others require human approval in internal Meta tooling.
- Telemetry can be delayed, missing, or inconsistent across regions.
- The system must remain useful during partial outages when some feature sources are unavailable.
- Explanations must be available for triage recommendations to support QA, SiteOps, and incident review.