Problem
A production web server behind Meta’s load balancing layer is returning intermittent HTTP 500 responses for a critical internal service. Walk through how you would debug this issue end to end.
Your answer should focus on systematic triage under production constraints, not just listing possible causes. Assume you have access to standard host-level tools, service logs, metrics dashboards, deployment history, and Meta-style observability surfaces.
What to cover
-
Scope and blast radius
- How you determine whether the issue affects one host, one region, one service version, or all traffic.
- What signals you would check first: error rate, latency, saturation, recent deploys, config pushes, certificate rotation, dependency health.
-
Request path isolation
- How you separate failures at the load balancer, reverse proxy, application process, local OS, and downstream dependencies such as cache, database, or auth service.
- How you would use logs, health checks, tracing, and synthetic requests to localize the fault.
-
Host and process debugging
- What you inspect on the affected machine: CPU, memory, file descriptors, disk, kernel errors, process crashes, OOM kills, thread exhaustion, TLS issues, permission changes.
- Which commands or artifacts you would use and why.
-
Security and incident handling
- When a 500 pattern should trigger security suspicion, such as malformed-request spikes, WAF bypass attempts, SSRF probes, credential or secret failures, or unexpected binary/config drift.
- How you preserve evidence while restoring service.
-
Mitigation and follow-up
- Immediate containment options: drain host, rollback deploy, disable bad config, fail over traffic, rate-limit abusive sources.
- Long-term fixes: better alerting, canaries, dependency circuit breakers, runbooks, and postmortem actions.
Use a concrete debugging sequence and explain the decision points that help you narrow the problem quickly.