Validate Generalization on New Data

Context

ShopEase built a binary classification model to predict whether a user will purchase within 7 days after viewing a product. The model is a gradient boosted tree classifier used to prioritize remarketing campaigns. It performed well during development, but marketing is concerned that results on newly launched traffic sources are weaker than expected.

Current Performance

Metric	Cross-Validation (Train)	Holdout Test	New Production Data (Last 30 Days)
Accuracy	0.91	0.89	0.82
Precision	0.78	0.75	0.61
Recall	0.72	0.69	0.48
F1 Score	0.75	0.72	0.54
AUC-ROC	0.90	0.87	0.76
Positive Rate	0.24	0.23	0.19

The Problem

The model appears strong on validation and test data, but performance drops materially on recent production data. The team wants to know whether the model truly generalizes to unseen data, what the metric gaps imply, and how to validate this systematically before expanding campaign spend.

Requirements

Explain whether the model is generalizing well and justify your conclusion using the metrics above.
Identify the most likely reasons for the gap between holdout test and new production performance.
Describe how you would validate generalization beyond a single train/test split.
Recommend specific analyses to detect overfitting, data drift, or label distribution changes.
Propose practical steps to improve confidence before full rollout.

Constraints

Marketing budget is fixed, so false positives waste spend.
Missing likely buyers reduces revenue from retargeting.
Retraining can only happen once every 2 weeks.
Labels arrive with a 7-day delay after prediction.

Context

Current Performance

Metric	Cross-Validation (Train)	Holdout Test	New Production Data (Last 30 Days)
Accuracy	0.91	0.89	0.82
Precision	0.78	0.75	0.61
Recall	0.72	0.69	0.48
F1 Score	0.75	0.72	0.54
AUC-ROC	0.90	0.87	0.76
Positive Rate	0.24	0.23	0.19

The Problem

Requirements

Explain whether the model is generalizing well and justify your conclusion using the metrics above.
Identify the most likely reasons for the gap between holdout test and new production performance.
Describe how you would validate generalization beyond a single train/test split.
Recommend specific analyses to detect overfitting, data drift, or label distribution changes.
Propose practical steps to improve confidence before full rollout.

Constraints

Marketing budget is fixed, so false positives waste spend.
Missing likely buyers reduces revenue from retargeting.
Retraining can only happen once every 2 weeks.
Labels arrive with a 7-day delay after prediction.

Context

Current Performance

Metric	Cross-Validation (Train)	Holdout Test	New Production Data (Last 30 Days)
Accuracy	0.91	0.89	0.82
Precision	0.78	0.75	0.61
Recall	0.72	0.69	0.48
F1 Score	0.75	0.72	0.54
AUC-ROC	0.90	0.87	0.76
Positive Rate	0.24	0.23	0.19

The Problem

Requirements

Explain whether the model is generalizing well and justify your conclusion using the metrics above.
Identify the most likely reasons for the gap between holdout test and new production performance.
Describe how you would validate generalization beyond a single train/test split.
Recommend specific analyses to detect overfitting, data drift, or label distribution changes.
Propose practical steps to improve confidence before full rollout.

Constraints

Marketing budget is fixed, so false positives waste spend.
Missing likely buyers reduces revenue from retargeting.
Retraining can only happen once every 2 weeks.
Labels arrive with a 7-day delay after prediction.

Context

Current Performance

Metric	Cross-Validation (Train)	Holdout Test	New Production Data (Last 30 Days)
Accuracy	0.91	0.89	0.82
Precision	0.78	0.75	0.61
Recall	0.72	0.69	0.48
F1 Score	0.75	0.72	0.54
AUC-ROC	0.90	0.87	0.76
Positive Rate	0.24	0.23	0.19

The Problem

Requirements

Explain whether the model is generalizing well and justify your conclusion using the metrics above.
Identify the most likely reasons for the gap between holdout test and new production performance.
Describe how you would validate generalization beyond a single train/test split.
Recommend specific analyses to detect overfitting, data drift, or label distribution changes.
Propose practical steps to improve confidence before full rollout.

Constraints

Marketing budget is fixed, so false positives waste spend.
Missing likely buyers reduces revenue from retargeting.
Retraining can only happen once every 2 weeks.
Labels arrive with a 7-day delay after prediction.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Validate Generalization on New Data

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Validate Generalization on New Data

Context

Current Performance

The Problem

Requirements

Constraints

Validate Generalization on New Data

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer