

You are reviewing a binary classifier that flags cases for manual review in a healthcare workflow. The team has four metrics on the same validation set, and they want a clear comparison of what each one says about model quality.
How would you compare model performance using precision, recall, F1-score, and ROC-AUC?