Boeing has deployed a gradient-boosted classifier in Boeing AnalytX to predict whether incoming aircraft maintenance findings should be escalated for urgent engineering review. In offline validation, the model looked strong, but after deployment across the 737 fleet support workflow, production outcomes deteriorated: more critical findings were missed and engineers report lower trust in the scores.
| Metric | Offline Test | Production (30 days) | Change |
|---|---|---|---|
| Precision | 0.84 | 0.68 | -0.16 |
| Recall | 0.81 | 0.57 | -0.24 |
| F1 Score | 0.82 | 0.62 | -0.20 |
| AUC-ROC | 0.91 | 0.79 | -0.12 |
| Log Loss | 0.29 | 0.51 | +0.22 |
| Escalation rate | 18.5% | 11.2% | -7.3 pts |
| Positive class rate | 16.9% | 22.4% | +5.5 pts |
The VP of Engineering wants to know why a model that passed testing is underperforming in production and what should be done before expanding to additional Boeing programs. You need to determine whether this is primarily a threshold issue, calibration failure, data drift, label mismatch, or a broader validation gap.