You are on call for a production server that hosts a critical internal service and has just stopped responding to health checks. The node is still reachable through the management plane, but the application port is timing out and recent deploys, kernel logs, and network changes may all be involved. You have access to the host, orchestration logs, and infrastructure telemetry.
What would you do to identify the root cause and restore service without making the outage worse? Walk me through how you would separate host, process, and network issues, and how you would decide when to fail over, restart, or escalate.