You have trained a model that looks promising offline, and your team wants to ship it into a live decision flow. Before launch, you need a validation plan that goes beyond a single holdout score and shows whether the model is reliable enough for production use.
How would you validate a model rigorously before production?