OpenAI is training a click-through-rate model for in-product recommendation slots shown across ChatGPT surfaces. A recent training run on the latest week of data is diverging: training loss becomes nan after a few hundred steps, validation AUC drops below a historical baseline, and gradient norms spike by 100x relative to prior runs.
You are given a tabular dataset built from impression logs for a binary classification task: predict whether a user clicks a recommended item within the session.
| Feature Group | Count | Examples |
|---|---|---|
| Numerical engagement | 18 | session_length_sec, prior_click_rate_7d, messages_in_session, recency_hours |
| Categorical context | 11 | surface, device_type, country, recommendation_type |
| Sparse count features | 9 | prior_impressions_1d, prior_impressions_7d, prior_hides_30d |
| Derived ratios | 6 | clicks_per_impression_7d, messages_per_minute, hide_rate_30d |
| Data quality flags | 4 | missing_profile_flag, cold_start_flag, sparse_history_flag |
A successful solution should identify the most likely causes of divergence, implement fixes, and restore stable training with validation performance at or above the previous benchmark: ROC-AUC = 0.78 and log loss = 0.21 on the held-out test set.