
You have trained a model and offline results look strong on the data used during development. Before relying on it, you need a clear way to judge whether that performance is likely to hold on truly unseen data.
How would you validate that a model will generalize well to unseen data?
Performance on untouched holdout dataStability across cross-validation foldsTrain versus validation gap for bias-variance diagnosisCalibration of predicted probabilities