At ModelOps Cloud, a team maintains an automated evaluation pipeline for a binary classification model that routes customer support tickets into urgent vs non-urgent queues. After a recent model update, leadership noticed that the quick deployment check passed, but the full validation suite later showed meaningful performance degradation on production-like data.
This question uses the software testing concepts of smoke testing and regression testing in a model evaluation setting. You need to explain the difference between them and interpret what the current results imply for model release quality.
| Check | Scope | Result | Key Metrics |
|---|---|---|---|
| Smoke test | 500 recent samples | Pass | Accuracy: 0.91, Precision: 0.89, Recall: 0.88, F1: 0.88 |
| Regression test | 20,000 benchmark samples | Fail | Accuracy: 0.86, Precision: 0.92, Recall: 0.61, F1: 0.73 |
| Previous production model | 20,000 benchmark samples | Pass | Accuracy: 0.88, Precision: 0.87, Recall: 0.79, F1: 0.83 |
| New model on high-priority tickets | 4,000 samples | Warning | Precision: 0.95, Recall: 0.54 |
The deployment pipeline allowed the model through an initial health check, but the broader benchmark indicates the new version misses too many urgent tickets. Product and operations teams want to know whether this is a testing design issue, a threshold issue, or a true model quality regression.