You are the DS owner of a gradient-boosted classifier that predicts whether a patient will discontinue a chronic-care treatment within 90 days, so outreach teams can intervene early. The model scores patients weekly, and anyone above a 0.60 threshold is sent to a limited nurse follow-up queue. In offline development, the model looked strong enough to replace a simpler logistic regression baseline, but after review, stakeholders noticed the gains were concentrated on training data and much weaker on unseen patients. You are asked whether the model is overfitting and whether it is safe to launch.
| Metric | Training Set | Validation Set | Holdout Test Set |
|---|---|---|---|
| Accuracy | 0.91 | 0.79 | 0.77 |
| Precision | 0.88 | 0.70 | 0.68 |
| Recall | 0.86 | 0.62 | 0.59 |
| F1 Score | 0.87 | 0.66 | 0.63 |
| AUC-ROC | 0.95 | 0.81 | 0.79 |
| Log Loss | 0.21 | 0.49 | 0.53 |
| Positive prediction rate | 18% | 14% | 13% |
How would you assess from these results whether the model is overfitting, and what would you recommend before deployment?