ShopSafe uses a binary classification model to detect fraudulent e-commerce orders before payment approval. The model performed well in offline validation, but after 8 weeks in production, operations reports rising fraud losses and more customer complaints about blocked legitimate orders.
| Metric | Offline Validation | Production (Last 30 Days) | Change |
|---|---|---|---|
| Accuracy | 0.94 | 0.91 | -0.03 |
| Precision | 0.78 | 0.61 | -0.17 |
| Recall | 0.74 | 0.52 | -0.22 |
| F1 Score | 0.76 | 0.56 | -0.20 |
| AUC-ROC | 0.89 | 0.81 | -0.08 |
| Fraud rate | 3.0% | 4.8% | +1.8 pts |
| Orders flagged/day | 1,150 | 1,420 | +270 |
| Monthly fraud loss | $180,000 | $310,000 | +$130,000 |
The team needs to determine whether the model is still performing well in production and what signals should be monitored beyond a single headline metric like accuracy. You should assess whether the current performance is acceptable, identify likely causes of degradation, and recommend how to evaluate the model continuously in production.