Triage a Downed Production Server

Medium

MediumSecurity & InfrastructureInfrastructureNetworkingTroubleshootingAsked 1 times

Outage Triage Scenario

Scenario

You are on call for a production server that hosts a critical internal service and has just stopped responding to health checks. The node is still reachable through the management plane, but the application port is timing out and recent deploys, kernel logs, and network changes may all be involved. You have access to the host, orchestration logs, and infrastructure telemetry.

Question

What would you do to identify the root cause and restore service without making the outage worse? Walk me through how you would separate host, process, and network issues, and how you would decide when to fail over, restart, or escalate.

Practicing as: DevOps Engineer interview at Cohesity

Hi, I'll play your Cohesity interviewer for the DevOps Engineer role. Candidates describe these interviews as often stressful and moderately difficult, so expect me to be direct and to the point. Take your time with the question above and answer like we're in the room.

Take this as a live interview session →

You are practicing as a guest. Sign up free to get your answer graded with AI feedback. Your draft stays right here.

Next questions

ETriage a Critical Production OutageHard

Triage a Production Line HaltMedium

Triage a Meta Server FailureMedium