You have a classifier that outputs probabilities, and stakeholders want to know whether those scores can be trusted for decision-making. The model may rank examples well, but the question is whether a predicted 0.8 really means about an 80% chance of the event.
How would you assess whether a model is well-calibrated?