Problem
Meta is rolling out a network configuration change across multiple regions supporting internal services and edge traffic. Shortly after deployment, engineers observe intermittent packet loss, elevated latency, and uneven request distribution behind load balancers. The issue appears inconsistent: some regions are healthy, while others show failures between routers, switches, and load balancers.
You are the QA Engineer responsible for validating the deployment and driving root-cause isolation.
Your Task
Describe how you would troubleshoot this issue in a global production-like environment.
Your answer should cover:
- Triage strategy: how you determine blast radius, affected regions, affected paths, and whether the issue is control-plane, data-plane, or configuration related.
- Layered debugging approach across routers, switches, and Meta load-balancing infrastructure (for example, edge and service traffic paths), including what telemetry, counters, logs, and health signals you would inspect first.
- Verification of recent changes: routing policy, ACLs, VLANs, BGP/ECMP behavior, NAT, health checks, MTU, link state, failover behavior, and config drift.
- Isolation methods: how you compare healthy vs unhealthy regions, reproduce safely, and distinguish device failure from bad rollout, dependency failure, or traffic asymmetry.
- Security and reliability considerations: how you ensure the issue is not caused by an unintended policy block, segmentation error, or exposure created during mitigation.
- Mitigation and rollback: what immediate actions you would take to restore service while preserving evidence for postmortem.
Deliverable
Provide a structured troubleshooting plan, including the order of investigation, key hypotheses, commands or checks you would run, and the criteria you would use to confirm root cause. Be explicit about how you would coordinate across regions and avoid making the incident worse during diagnosis.