Business Context
You’re on the Ads Ranking team at a large e-commerce marketplace (≈35M DAUs, ≈120M ad impressions/day). The platform uses a neural CTR model to predict the probability a user clicks an ad; this prediction directly impacts auction outcomes and revenue. Over the last week, revenue dropped ~3% and offline AUC-ROC fell from ~0.78 to ~0.71 after a “minor refactor” that introduced a new embedding feature and migrated the training loop from a high-level framework to a custom PyTorch loop.
Your task is to diagnose and fix the training pipeline. The interviewer is explicitly evaluating whether you understand the role of backpropagation in training neural networks—not as a definition, but as an engineering tool for debugging gradient flow, stability, and correctness.
Dataset
You have a logged dataset of ad impressions with delayed click labels.
| Feature Group | Count | Examples | Notes |
|---|
| Dense numerical | 18 | price, discount_pct, historical_ctr_7d, time_since_last_purchase | Standardized; some heavy tails |
| Categorical (ID) | 6 | user_id, ad_id, query_id, category_id, device_type, geo | Embedded; high cardinality |
| Contextual | 10 | hour_of_day, day_of_week, session_depth, page_type | Mostly low-cardinality |
| Target | 1 | clicked (0/1) | Click within 24h window |
- Size: ~1.2B impressions over 14 days; training uses a 5% sample (~60M rows) for iteration speed.
- Class balance: ~1.1% positive (clicked), 98.9% negative.
- Missing data: ~8% missing in some dense features (e.g., historical_ctr_7d for new ads); ~2% missing geo.
- Known issue symptoms (from training dashboards):
- Training loss decreases for ~200 steps then becomes
nan.
- Gradient norm spikes from ~5 to >10,000.
- Embedding weights for
ad_id become inf in a few minutes.
Success Criteria
- Restore stable training (no NaNs/Infs) for at least 3 epochs on the 60M-row sample.
- Achieve offline AUC-ROC ≥ 0.77 and AUC-PR ≥ 0.18 on a time-based holdout day.
- Demonstrate a concrete gradient debugging plan that uses backprop artifacts (gradients, Jacobians, hooks) to localize the fault.
- Provide at least two preventative guardrails to catch similar backprop/gradient regressions in CI.
Constraints
- Training: 4×A10 GPUs, max 2 hours per experiment; mixed precision (AMP) is preferred for throughput.
- Inference: p99 latency budget 8 ms on CPU for a single request; model must remain an MLP with embeddings (no large transformers).
- Compliance: Must be able to explain why the model changed (feature attribution at a high level); cannot log raw user identifiers.
Deliverables
- Explain, in the context of this CTR model, what backpropagation is computing and how it connects loss → parameter updates.
- A step-by-step debugging plan to identify why gradients explode / become NaN (include what tensors to inspect and where).
- A corrected training approach (loss, optimizer, initialization, normalization, clipping, AMP settings) that stabilizes backprop.
- An evaluation plan (splits, metrics, thresholding) appropriate for highly imbalanced CTR.
- Two production guardrails (unit tests / assertions / monitoring) that specifically validate backprop correctness and gradient health.