ShopLens built a binary classification model to predict whether a customer will click a recommended product in the app. The team has been using a logistic regression baseline and is testing a new gradient boosting model before rollout. Leadership sees slightly better top-line metrics from the new model, but the product team is concerned about whether the improvement is meaningful and whether any tradeoffs are hidden.
| Metric | Baseline Model | New Model | Change |
|---|---|---|---|
| Accuracy | 0.842 | 0.861 | +0.019 |
| Precision | 0.610 | 0.680 | +0.070 |
| Recall | 0.550 | 0.470 | -0.080 |
| F1 Score | 0.578 | 0.554 | -0.024 |
| AUC-ROC | 0.781 | 0.824 | +0.043 |
| Log Loss | 0.462 | 0.418 | -0.044 |
| Positive rate in data | 0.180 | 0.180 | 0.000 |
The new model appears better on accuracy and AUC-ROC, but it detects fewer actual positive cases. You need to explain the difference between the baseline and new model in a practical evaluation setting and determine whether the new model should replace the baseline.