You've shipped a model that looked strong in offline training and validation, but it is underperforming in production. The team wants you to figure out why the offline results did not carry over.
How do you debug a model that looks good offline but fails in production?