You own a gradient-boosted document classification model that routes uploaded files into one of two review queues in a creative workflow platform. The model was trained on historical enterprise documents and deployed with a 0.60 decision threshold after offline validation looked strong. Product teams now want to expand the model to small-business traffic and partner-ingested documents, but recent backtesting shows inconsistent performance across datasets even though overall accuracy still looks acceptable. You need to explain whether the model is actually ready for broader deployment and how you would evaluate it across these datasets.
| Metric | Internal Validation | New SMB Dataset | External Partner Dataset |
|---|---|---|---|
| Accuracy | 0.93 | 0.88 | 0.84 |
| Precision | 0.91 | 0.79 | 0.72 |
| Recall | 0.89 | 0.68 | 0.54 |
| F1 Score | 0.90 | 0.73 | 0.62 |
| Positive class rate | 42% | 27% | 18% |
| False positive rate | 0.05 | 0.09 | 0.12 |
| False negative rate | 0.11 | 0.32 | 0.46 |
How would you evaluate this model across the three datasets, interpret whether the degradation is acceptable, and recommend what should happen before broader rollout?