
You have trained a model and now need to assess whether its performance is consistent across multiple datasets, such as training, validation, test, or data from different sources. The team wants a clear evaluation approach that goes beyond looking at a single metric on one split.
How do you evaluate the performance of a machine learning model across different datasets?
Training set from the historical distributionValidation split used for model selectionTime-based holdout representing recent trafficNew segment dataset with different feature and label balance