Interview Guides

Design ML Network Repair Pipeline

Hard

ML System Design

Product Context

Meta operates a large global network spanning datacenters, backbone links, and edge POPs that support Facebook, Instagram, WhatsApp, and internal infrastructure. Design an ML system that detects network anomalies, triages likely root causes, and recommends or triggers directed repair actions for SiteOps and network automation systems.

Scale

Signal	Value
Network devices monitored	2.5M routers, switches, optics, and hosts
Telemetry events/day	45B counters, logs, alerts, and flow summaries
Peak telemetry ingest	1.2M events/sec
Concurrent incidents/day	25K anomaly clusters
Candidate repair actions	5K runbooks / automation actions
Online latency budget	3s p99 from anomaly trigger to ranked repair plan

Task

Clarify the functional goals across detection, incident clustering, root-cause triage, and repair recommendation.
Design an end-to-end multi-stage ML architecture, including retrieval, ranking, and re-ranking of candidate causes and repair actions.
Specify the offline and online data pipelines, feature store design, and training cadence for fast-changing network conditions.
Choose models for each stage and explain tradeoffs in accuracy, latency, interpretability, and operator trust.
Define evaluation strategy, including offline metrics, online rollout, human-in-the-loop validation, and guardrails.
Identify failure modes such as feature drift, training-serving skew, bad automated actions, and telemetry outages, with detection and mitigation plans.

Constraints

False positives are expensive: unnecessary repairs can impact production traffic.
Some actions may be auto-executed only for low-risk classes; others require human approval in internal Meta tooling.
Telemetry can be delayed, missing, or inconsistent across regions.
The system must remain useful during partial outages when some feature sources are unavailable.
Explanations must be available for triage recommendations to support QA, SiteOps, and incident review.

Design ML Network Repair Pipeline

Hard

ML System Design

Product Context

Scale

Signal	Value
Network devices monitored	2.5M routers, switches, optics, and hosts
Telemetry events/day	45B counters, logs, alerts, and flow summaries
Peak telemetry ingest	1.2M events/sec
Concurrent incidents/day	25K anomaly clusters
Candidate repair actions	5K runbooks / automation actions
Online latency budget	3s p99 from anomaly trigger to ranked repair plan

Task

Clarify the functional goals across detection, incident clustering, root-cause triage, and repair recommendation.
Design an end-to-end multi-stage ML architecture, including retrieval, ranking, and re-ranking of candidate causes and repair actions.
Specify the offline and online data pipelines, feature store design, and training cadence for fast-changing network conditions.
Choose models for each stage and explain tradeoffs in accuracy, latency, interpretability, and operator trust.
Define evaluation strategy, including offline metrics, online rollout, human-in-the-loop validation, and guardrails.
Identify failure modes such as feature drift, training-serving skew, bad automated actions, and telemetry outages, with detection and mitigation plans.

Constraints

False positives are expensive: unnecessary repairs can impact production traffic.
Some actions may be auto-executed only for low-risk classes; others require human approval in internal Meta tooling.
Telemetry can be delayed, missing, or inconsistent across regions.
The system must remain useful during partial outages when some feature sources are unavailable.
Explanations must be available for triage recommendations to support QA, SiteOps, and incident review.

Your Answer

Design ML Network Repair Pipeline

Hard

ML System Design

Product Context

Scale

Signal	Value
Network devices monitored	2.5M routers, switches, optics, and hosts
Telemetry events/day	45B counters, logs, alerts, and flow summaries
Peak telemetry ingest	1.2M events/sec
Concurrent incidents/day	25K anomaly clusters
Candidate repair actions	5K runbooks / automation actions
Online latency budget	3s p99 from anomaly trigger to ranked repair plan

Task

Clarify the functional goals across detection, incident clustering, root-cause triage, and repair recommendation.
Design an end-to-end multi-stage ML architecture, including retrieval, ranking, and re-ranking of candidate causes and repair actions.
Specify the offline and online data pipelines, feature store design, and training cadence for fast-changing network conditions.
Choose models for each stage and explain tradeoffs in accuracy, latency, interpretability, and operator trust.
Define evaluation strategy, including offline metrics, online rollout, human-in-the-loop validation, and guardrails.
Identify failure modes such as feature drift, training-serving skew, bad automated actions, and telemetry outages, with detection and mitigation plans.

Constraints

False positives are expensive: unnecessary repairs can impact production traffic.
Some actions may be auto-executed only for low-risk classes; others require human approval in internal Meta tooling.
Telemetry can be delayed, missing, or inconsistent across regions.
The system must remain useful during partial outages when some feature sources are unavailable.
Explanations must be available for triage recommendations to support QA, SiteOps, and incident review.

Design ML Network Repair Pipeline

Hard

ML System Design

Product Context

Scale

Signal	Value
Network devices monitored	2.5M routers, switches, optics, and hosts
Telemetry events/day	45B counters, logs, alerts, and flow summaries
Peak telemetry ingest	1.2M events/sec
Concurrent incidents/day	25K anomaly clusters
Candidate repair actions	5K runbooks / automation actions
Online latency budget	3s p99 from anomaly trigger to ranked repair plan

Task

Clarify the functional goals across detection, incident clustering, root-cause triage, and repair recommendation.
Design an end-to-end multi-stage ML architecture, including retrieval, ranking, and re-ranking of candidate causes and repair actions.
Specify the offline and online data pipelines, feature store design, and training cadence for fast-changing network conditions.
Choose models for each stage and explain tradeoffs in accuracy, latency, interpretability, and operator trust.
Define evaluation strategy, including offline metrics, online rollout, human-in-the-loop validation, and guardrails.
Identify failure modes such as feature drift, training-serving skew, bad automated actions, and telemetry outages, with detection and mitigation plans.

Constraints

False positives are expensive: unnecessary repairs can impact production traffic.
Some actions may be auto-executed only for low-risk classes; others require human approval in internal Meta tooling.
Telemetry can be delayed, missing, or inconsistent across regions.
The system must remain useful during partial outages when some feature sources are unavailable.
Explanations must be available for triage recommendations to support QA, SiteOps, and incident review.