Problem
A production server in Meta’s infrastructure starts malfunctioning during peak traffic. Symptoms include elevated request latency, intermittent 5xx errors, and missing heartbeats in fleet health monitoring. The issue may stem from hardware failure, kernel or OS corruption, network isolation, resource exhaustion, or a security incident.
As the on-call DevOps engineer, explain how you would respond from first alert to recovery.
What to cover
-
Immediate containment and safety
- How you would assess blast radius.
- When you would remove the host from service, drain traffic, or isolate it from the network.
- How you would preserve forensic evidence if compromise is possible.
-
Structured diagnosis
- The order in which you would check hardware signals, kernel logs, process health, filesystem state, network paths, and recent deploy or config changes.
- What signals from Meta-style observability systems, host metrics, and service dashboards you would use.
- How you would distinguish between a single-host issue and a broader rack, cluster, or dependency problem.
-
Recovery plan
- Short-term actions to restore service quickly.
- Criteria for rebooting, reprovisioning, failing over, or replacing hardware.
- How you would validate that the server is safe to return to production.
-
Communication and follow-up
- Who you would notify and when.
- What you would document during the incident.
- What post-incident actions you would drive to prevent recurrence.
Deliverable
Provide a concise incident-response playbook for this scenario. Use clear decision points, mention trade-offs between speed and evidence preservation, and assume the environment is large-scale, automated, and security-sensitive.