You have trained a model and now need to compare its performance across multiple datasets, such as training, validation, test, and a newly collected holdout set. The team wants to know whether the model is truly generalizing or if performance differences are coming from dataset shift, class imbalance, or inconsistent score calibration.
How do you evaluate the performance of a machine learning model across different datasets?