Your team has trained a model and offline results look promising. Before shipping it, you need to decide whether the model is actually good enough for production and what evidence would justify deployment.
How would you evaluate whether a model is suitable for deployment?