Product Context
Meta operates a large global backbone and datacenter network that supports Facebook, Instagram, WhatsApp, Messenger, and internal services. Design an ML-driven validation system that scores proposed network changes before rollout in NetNORAD-like change management workflows, so network engineers can catch risky changes early and reduce incidents.
Scale
| Signal | Value |
|---|
| Network changes submitted/day | 1.8M |
| Peak validation QPS | 3,500 change requests/sec during rollout windows |
| Devices / endpoints in scope | 12M routers, switches, load balancers, hosts |
| Historical change records | 4B over 3 years |
| Topology / config graph size | 40B edges across regions and services |
| End-to-end scoring latency budget (p99) | 250ms per change |
Task
- Clarify the product goal, prediction target, and what “validate before rollout” means operationally.
- Design an end-to-end ML system, including candidate retrieval, ranking, and final re-ranking / policy gating for change approval.
- Specify the offline and online data pipelines, labels, feature store design, and how you avoid training-serving skew.
- Choose models for each stage and justify tradeoffs across accuracy, latency, interpretability, and cost.
- Define offline evaluation, online rollout strategy, and guardrails for safe deployment.
- Identify key failure modes, including feature drift, stale topology data, and false negatives that allow bad changes through.
Constraints
- The system must support both synchronous pre-submit validation and asynchronous deeper analysis after submission.
- Many labels are delayed: incidents may be attributed minutes to hours after a change, and attribution can be noisy.
- The final decision must remain human-auditable; engineers need explanations and top risk factors.
- Some regions have stricter compliance requirements, so raw configs may need redaction or feature hashing before model training.
- False negatives are much more expensive than false positives, but too many false positives will block engineer productivity and cause alert fatigue.