"Tell me about a time you made a mistake that caused a production outage. How did you handle it? In your answer, walk me through what happened, how you responded in the moment, how you communicated during the incident, and what you changed afterward to reduce the chance of it happening again."
For a DevOps Engineer at Meta, this question tests whether you show real ownership when reliability is on the line. Interviewers want to understand how you behave under pressure: whether you can stabilize a system quickly, communicate clearly across engineering and partner teams, and avoid becoming defensive when your own action contributed to the incident. In a high-scale environment involving surfaces like TAO-backed services, Configerator-managed changes, or production rollouts, mistakes happen; what matters is judgment, transparency, and learning velocity.
A strong answer is specific: name the change, the blast radius, the timeline, and the operational impact. The best responses show calm incident leadership, fast prioritization of mitigation over ego, clear stakeholder updates, a credible postmortem, and one or more concrete preventive fixes with measurable results.