ShopLens runs a purchase propensity model that scores 4.2M users daily for email and in-app promotions. Over the last quarter, offline validation remained stable, but online campaign performance became inconsistent after multiple untracked data refreshes and model replacements.
The team currently stores model artifacts in object storage with ad hoc names and overwrites feature tables in place. Leadership wants a production-ready versioning strategy for both data and models that preserves reproducibility, supports rollback, and makes metric changes explainable.
| Metric | Model v17 + Data snapshot 2024-05 | Model v18 + overwritten data 2024-06 | Change |
|---|---|---|---|
| AUC-ROC | 0.81 | 0.80 | -0.01 |
| Log Loss | 0.46 | 0.51 | +0.05 |
| Precision @ top 10% | 0.29 | 0.24 | -0.05 |
| Recall @ top 10% | 0.41 | 0.36 | -0.05 |
| Calibration error | 0.03 | 0.09 | +0.06 |
| Campaign conversion rate | 3.8% | 3.1% | -0.7 pp |
The team cannot determine whether the drop came from model changes, training data changes, feature definition drift, or threshold updates. There is no reliable mapping between a prediction in production and the exact training dataset, feature code, hyperparameters, or evaluation report used to create it.