
You are responding to a severe production outage affecting a critical customer-facing system. Service is degraded, leadership wants frequent updates, and multiple teams are involved in diagnosis and recovery. After the incident is stabilized, you need to explain how you determined the root cause and what process you used to decide between mitigation, rollback, and deeper investigation.
Tell me about a time you had to troubleshoot a severe production outage. What was your RCA process?