Troubleshoot Global Network Rollout

Problem

Meta is rolling out a network configuration change across multiple regions supporting internal services and edge traffic. Shortly after deployment, engineers observe intermittent packet loss, elevated latency, and uneven request distribution behind load balancers. The issue appears inconsistent: some regions are healthy, while others show failures between routers, switches, and load balancers.

You are the QA Engineer responsible for validating the deployment and driving root-cause isolation.

Your Task

Describe how you would troubleshoot this issue in a global production-like environment.

Your answer should cover:

Triage strategy: how you determine blast radius, affected regions, affected paths, and whether the issue is control-plane, data-plane, or configuration related.
Layered debugging approach across routers, switches, and Meta load-balancing infrastructure (for example, edge and service traffic paths), including what telemetry, counters, logs, and health signals you would inspect first.
Verification of recent changes: routing policy, ACLs, VLANs, BGP/ECMP behavior, NAT, health checks, MTU, link state, failover behavior, and config drift.
Isolation methods: how you compare healthy vs unhealthy regions, reproduce safely, and distinguish device failure from bad rollout, dependency failure, or traffic asymmetry.
Security and reliability considerations: how you ensure the issue is not caused by an unintended policy block, segmentation error, or exposure created during mitigation.
Mitigation and rollback: what immediate actions you would take to restore service while preserving evidence for postmortem.

Deliverable

Provide a structured troubleshooting plan, including the order of investigation, key hypotheses, commands or checks you would run, and the criteria you would use to confirm root cause. Be explicit about how you would coordinate across regions and avoid making the incident worse during diagnosis.

Problem

You are the QA Engineer responsible for validating the deployment and driving root-cause isolation.

Your Task

Describe how you would troubleshoot this issue in a global production-like environment.

Your answer should cover:

Triage strategy: how you determine blast radius, affected regions, affected paths, and whether the issue is control-plane, data-plane, or configuration related.
Layered debugging approach across routers, switches, and Meta load-balancing infrastructure (for example, edge and service traffic paths), including what telemetry, counters, logs, and health signals you would inspect first.
Verification of recent changes: routing policy, ACLs, VLANs, BGP/ECMP behavior, NAT, health checks, MTU, link state, failover behavior, and config drift.
Isolation methods: how you compare healthy vs unhealthy regions, reproduce safely, and distinguish device failure from bad rollout, dependency failure, or traffic asymmetry.
Security and reliability considerations: how you ensure the issue is not caused by an unintended policy block, segmentation error, or exposure created during mitigation.
Mitigation and rollback: what immediate actions you would take to restore service while preserving evidence for postmortem.

Deliverable

Problem

You are the QA Engineer responsible for validating the deployment and driving root-cause isolation.

Your Task

Describe how you would troubleshoot this issue in a global production-like environment.

Your answer should cover:

Triage strategy: how you determine blast radius, affected regions, affected paths, and whether the issue is control-plane, data-plane, or configuration related.
Layered debugging approach across routers, switches, and Meta load-balancing infrastructure (for example, edge and service traffic paths), including what telemetry, counters, logs, and health signals you would inspect first.
Verification of recent changes: routing policy, ACLs, VLANs, BGP/ECMP behavior, NAT, health checks, MTU, link state, failover behavior, and config drift.
Isolation methods: how you compare healthy vs unhealthy regions, reproduce safely, and distinguish device failure from bad rollout, dependency failure, or traffic asymmetry.
Security and reliability considerations: how you ensure the issue is not caused by an unintended policy block, segmentation error, or exposure created during mitigation.
Mitigation and rollback: what immediate actions you would take to restore service while preserving evidence for postmortem.

Deliverable

Problem

You are the QA Engineer responsible for validating the deployment and driving root-cause isolation.

Your Task

Describe how you would troubleshoot this issue in a global production-like environment.

Your answer should cover:

Triage strategy: how you determine blast radius, affected regions, affected paths, and whether the issue is control-plane, data-plane, or configuration related.
Layered debugging approach across routers, switches, and Meta load-balancing infrastructure (for example, edge and service traffic paths), including what telemetry, counters, logs, and health signals you would inspect first.
Verification of recent changes: routing policy, ACLs, VLANs, BGP/ECMP behavior, NAT, health checks, MTU, link state, failover behavior, and config drift.
Isolation methods: how you compare healthy vs unhealthy regions, reproduce safely, and distinguish device failure from bad rollout, dependency failure, or traffic asymmetry.
Security and reliability considerations: how you ensure the issue is not caused by an unintended policy block, segmentation error, or exposure created during mitigation.
Mitigation and rollback: what immediate actions you would take to restore service while preserving evidence for postmortem.

Interview Guides

Problem

Your Task

Deliverable

Troubleshoot Global Network Rollout

Problem

Your Task

Deliverable

Your Answer

Troubleshoot Global Network Rollout

Problem

Your Task

Deliverable

Troubleshoot Global Network Rollout

Problem

Your Task

Deliverable

Your Answer