AYou have an offline model that looks strong in validation, but the team is asking whether it actually works in the field. The same score threshold is being used in production, and stakeholders want evidence that the model's decisions hold up once real users, real delays, and real labels are involved.
How do you validate real-world performance of a model beyond offline metrics?