Debug Diverging Ad CTR Training

Business Context

OpenAI is training a click-through-rate model for in-product recommendation slots shown across ChatGPT surfaces. A recent training run on the latest week of data is diverging: training loss becomes nan after a few hundred steps, validation AUC drops below a historical baseline, and gradient norms spike by 100x relative to prior runs.

Dataset

You are given a tabular dataset built from impression logs for a binary classification task: predict whether a user clicks a recommended item within the session.

Feature Group	Count	Examples
Numerical engagement	18	session_length_sec, prior_click_rate_7d, messages_in_session, recency_hours
Categorical context	11	surface, device_type, country, recommendation_type
Sparse count features	9	prior_impressions_1d, prior_impressions_7d, prior_hides_30d
Derived ratios	6	clicks_per_impression_7d, messages_per_minute, hide_rate_30d
Data quality flags	4	missing_profile_flag, cold_start_flag, sparse_history_flag

Size: 2.4M impressions over 14 days, 48 features
Target: Binary — clicked recommendation (1) vs not clicked (0)
Class balance: 6.4% positive, 93.6% negative
Missing data: 8% missing in user-history features for cold-start users; rare extreme outliers in count features

Success Criteria

A successful solution should identify the most likely causes of divergence, implement fixes, and restore stable training with validation performance at or above the previous benchmark: ROC-AUC = 0.78 and log loss = 0.21 on the held-out test set.

Constraints

Nightly retraining must finish within 45 minutes on a single GPU or CPU-only fallback.
The model should support probability calibration for ranking downstream.
The debugging workflow must be reproducible and suitable for production incident response.

Deliverables

Propose a step-by-step debugging plan for the diverging run.
Build a stable training pipeline and explain each mitigation.
Compare a baseline linear model against a small neural network.
Report evaluation metrics and training diagnostics before and after fixes.
Recommend production monitors to catch future divergence early.

Business Context

Dataset

You are given a tabular dataset built from impression logs for a binary classification task: predict whether a user clicks a recommended item within the session.

Feature Group	Count	Examples
Numerical engagement	18	session_length_sec, prior_click_rate_7d, messages_in_session, recency_hours
Categorical context	11	surface, device_type, country, recommendation_type
Sparse count features	9	prior_impressions_1d, prior_impressions_7d, prior_hides_30d
Derived ratios	6	clicks_per_impression_7d, messages_per_minute, hide_rate_30d
Data quality flags	4	missing_profile_flag, cold_start_flag, sparse_history_flag

Size: 2.4M impressions over 14 days, 48 features
Target: Binary — clicked recommendation (1) vs not clicked (0)
Class balance: 6.4% positive, 93.6% negative
Missing data: 8% missing in user-history features for cold-start users; rare extreme outliers in count features

Success Criteria

Constraints

Nightly retraining must finish within 45 minutes on a single GPU or CPU-only fallback.
The model should support probability calibration for ranking downstream.
The debugging workflow must be reproducible and suitable for production incident response.

Deliverables

Propose a step-by-step debugging plan for the diverging run.
Build a stable training pipeline and explain each mitigation.
Compare a baseline linear model against a small neural network.
Report evaluation metrics and training diagnostics before and after fixes.
Recommend production monitors to catch future divergence early.

Business Context

Dataset

You are given a tabular dataset built from impression logs for a binary classification task: predict whether a user clicks a recommended item within the session.

Feature Group	Count	Examples
Numerical engagement	18	session_length_sec, prior_click_rate_7d, messages_in_session, recency_hours
Categorical context	11	surface, device_type, country, recommendation_type
Sparse count features	9	prior_impressions_1d, prior_impressions_7d, prior_hides_30d
Derived ratios	6	clicks_per_impression_7d, messages_per_minute, hide_rate_30d
Data quality flags	4	missing_profile_flag, cold_start_flag, sparse_history_flag

Size: 2.4M impressions over 14 days, 48 features
Target: Binary — clicked recommendation (1) vs not clicked (0)
Class balance: 6.4% positive, 93.6% negative
Missing data: 8% missing in user-history features for cold-start users; rare extreme outliers in count features

Success Criteria

Constraints

Nightly retraining must finish within 45 minutes on a single GPU or CPU-only fallback.
The model should support probability calibration for ranking downstream.
The debugging workflow must be reproducible and suitable for production incident response.

Deliverables

Propose a step-by-step debugging plan for the diverging run.
Build a stable training pipeline and explain each mitigation.
Compare a baseline linear model against a small neural network.
Report evaluation metrics and training diagnostics before and after fixes.
Recommend production monitors to catch future divergence early.

Business Context

Dataset

You are given a tabular dataset built from impression logs for a binary classification task: predict whether a user clicks a recommended item within the session.

Feature Group	Count	Examples
Numerical engagement	18	session_length_sec, prior_click_rate_7d, messages_in_session, recency_hours
Categorical context	11	surface, device_type, country, recommendation_type
Sparse count features	9	prior_impressions_1d, prior_impressions_7d, prior_hides_30d
Derived ratios	6	clicks_per_impression_7d, messages_per_minute, hide_rate_30d
Data quality flags	4	missing_profile_flag, cold_start_flag, sparse_history_flag

Size: 2.4M impressions over 14 days, 48 features
Target: Binary — clicked recommendation (1) vs not clicked (0)
Class balance: 6.4% positive, 93.6% negative
Missing data: 8% missing in user-history features for cold-start users; rare extreme outliers in count features

Success Criteria

Constraints

Nightly retraining must finish within 45 minutes on a single GPU or CPU-only fallback.
The model should support probability calibration for ranking downstream.
The debugging workflow must be reproducible and suitable for production incident response.

Deliverables

Propose a step-by-step debugging plan for the diverging run.
Build a stable training pipeline and explain each mitigation.
Compare a baseline linear model against a small neural network.
Report evaluation metrics and training diagnostics before and after fixes.
Recommend production monitors to catch future divergence early.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Debug Diverging Ad CTR Training

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Debug Diverging Ad CTR Training

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Debug Diverging Ad CTR Training

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer