ShopLens built a binary classification model to predict whether a customer will make a repeat purchase within 30 days of their first order. A LightGBM model looked excellent in offline validation, but performance dropped sharply after deployment. The team suspects data leakage introduced during feature engineering.
| Metric | Offline Validation | Production Holdout | Change |
|---|---|---|---|
| Accuracy | 0.91 | 0.74 | -0.17 |
| Precision | 0.88 | 0.63 | -0.25 |
| Recall | 0.86 | 0.52 | -0.34 |
| F1 Score | 0.87 | 0.57 | -0.30 |
| AUC-ROC | 0.95 | 0.69 | -0.26 |
| Log Loss | 0.21 | 0.61 | +0.40 |
The feature set includes customer tenure, average basket size, email engagement, support contacts, and rolling 30-day order aggregates. During review, the team found that some aggregates may have been computed using data extending beyond the prediction timestamp.
You need to determine whether the performance gap is caused by leakage, identify which features or validation steps are most suspicious, and recommend how to redesign the feature engineering and evaluation process.