ShopLens is building a binary classification model to predict whether a visitor will purchase within the current session so the marketing team can trigger a limited discount. A product manager is concerned because the model scored well on one test split, but performance varied when the data science team re-ran evaluation on different samples.
| Evaluation setup | Accuracy | Precision | Recall | F1 | AUC-ROC | Log Loss |
|---|---|---|---|---|---|---|
| Single 80/20 split (best run) | 0.84 | 0.71 | 0.58 | 0.64 | 0.81 | 0.46 |
| Single 80/20 split (worst run) | 0.76 | 0.55 | 0.49 | 0.52 | 0.69 | 0.61 |
| 5-fold cross-validation mean | 0.80 | 0.63 | 0.54 | 0.58 | 0.75 | 0.53 |
| 5-fold cross-validation std. dev. | 0.03 | 0.06 | 0.04 | 0.05 | 0.05 | 0.06 |
| Holdout set after model selection | 0.79 | 0.61 | 0.53 | 0.57 | 0.74 | 0.55 |
The CMO asks: "Why do we need cross-validation instead of just reporting the best test score?" You need to explain the importance of cross-validation to a non-technical stakeholder and connect it to business risk.