You have a classification model that outputs probabilities, and stakeholders want to use those scores for decision-making rather than just ranking. You need to judge whether the predicted probabilities can be trusted as stated, for example whether cases scored at 0.8 really occur about 80% of the time.
How would you determine whether a model is well calibrated?