"Tell me about a machine learning model you shipped that failed in production. What happened, how did you respond, and what did you do afterward to prevent it from happening again? If helpful, you can ground your answer in a Meta-relevant surface such as Feed, Reels, Ads ranking, Integrity, or Notifications."
This question tests whether you take real ownership when an ML system breaks in the messy reality of production, where offline metrics often diverge from online outcomes. I’m looking for how you handled ambiguity, triaged impact, communicated with cross-functional partners, and balanced immediate mitigation with long-term fixes. At Meta, strong ML engineers are expected not just to build models, but to operate them responsibly at scale.
A weak answer treats the failure as bad luck, blames data or infrastructure teams, or focuses only on technical debugging. A strong answer shows judgment under pressure: how you detected the issue, how you prioritized user or business risk, how you influenced others without formal authority, and how you improved the system, process, or team afterward.
Use a specific example with clear stakes, timeline, and measurable impact. Structure it in STAR format, and make sure the "Action" section covers both incident response and the durable changes you drove after the failure.