


You are reviewing a binary classification model and the team is debating which metric should drive the final decision. The model scores each case and a threshold turns the score into an action. Different stakeholders care about different kinds of mistakes, so the same model can look strong on one metric and weak on another.
How do you decide whether to use Precision, Recall, F1-score, or ROC-AUC?