Backprop Debugging in Ad CTR Model

Business Context

You’re on the Ads Ranking team at a large e-commerce marketplace (≈35M DAUs, ≈120M ad impressions/day). The platform uses a neural CTR model to predict the probability a user clicks an ad; this prediction directly impacts auction outcomes and revenue. Over the last week, revenue dropped ~3% and offline AUC-ROC fell from ~0.78 to ~0.71 after a “minor refactor” that introduced a new embedding feature and migrated the training loop from a high-level framework to a custom PyTorch loop.

Your task is to diagnose and fix the training pipeline. The interviewer is explicitly evaluating whether you understand the role of backpropagation in training neural networks—not as a definition, but as an engineering tool for debugging gradient flow, stability, and correctness.

Dataset

You have a logged dataset of ad impressions with delayed click labels.

Feature Group	Count	Examples	Notes
Dense numerical	18	price, discount_pct, historical_ctr_7d, time_since_last_purchase	Standardized; some heavy tails
Categorical (ID)	6	user_id, ad_id, query_id, category_id, device_type, geo	Embedded; high cardinality
Contextual	10	hour_of_day, day_of_week, session_depth, page_type	Mostly low-cardinality
Target	1	clicked (0/1)	Click within 24h window

Size: ~1.2B impressions over 14 days; training uses a 5% sample (~60M rows) for iteration speed.
Class balance: ~1.1% positive (clicked), 98.9% negative.
Missing data: ~8% missing in some dense features (e.g., historical_ctr_7d for new ads); ~2% missing geo.
Known issue symptoms (from training dashboards):
- Training loss decreases for ~200 steps then becomes nan.
- Gradient norm spikes from ~5 to >10,000.
- Embedding weights for ad_id become inf in a few minutes.

Success Criteria

Restore stable training (no NaNs/Infs) for at least 3 epochs on the 60M-row sample.
Achieve offline AUC-ROC ≥ 0.77 and AUC-PR ≥ 0.18 on a time-based holdout day.
Demonstrate a concrete gradient debugging plan that uses backprop artifacts (gradients, Jacobians, hooks) to localize the fault.
Provide at least two preventative guardrails to catch similar backprop/gradient regressions in CI.

Constraints

Training: 4×A10 GPUs, max 2 hours per experiment; mixed precision (AMP) is preferred for throughput.
Inference: p99 latency budget 8 ms on CPU for a single request; model must remain an MLP with embeddings (no large transformers).
Compliance: Must be able to explain why the model changed (feature attribution at a high level); cannot log raw user identifiers.

Deliverables

Explain, in the context of this CTR model, what backpropagation is computing and how it connects loss → parameter updates.
A step-by-step debugging plan to identify why gradients explode / become NaN (include what tensors to inspect and where).
A corrected training approach (loss, optimizer, initialization, normalization, clipping, AMP settings) that stabilizes backprop.
An evaluation plan (splits, metrics, thresholding) appropriate for highly imbalanced CTR.
Two production guardrails (unit tests / assertions / monitoring) that specifically validate backprop correctness and gradient health.

Business Context

Dataset

You have a logged dataset of ad impressions with delayed click labels.

Feature Group	Count	Examples	Notes
Dense numerical	18	price, discount_pct, historical_ctr_7d, time_since_last_purchase	Standardized; some heavy tails
Categorical (ID)	6	user_id, ad_id, query_id, category_id, device_type, geo	Embedded; high cardinality
Contextual	10	hour_of_day, day_of_week, session_depth, page_type	Mostly low-cardinality
Target	1	clicked (0/1)	Click within 24h window

Size: ~1.2B impressions over 14 days; training uses a 5% sample (~60M rows) for iteration speed.
Class balance: ~1.1% positive (clicked), 98.9% negative.
Missing data: ~8% missing in some dense features (e.g., historical_ctr_7d for new ads); ~2% missing geo.
Known issue symptoms (from training dashboards):
- Training loss decreases for ~200 steps then becomes nan.
- Gradient norm spikes from ~5 to >10,000.
- Embedding weights for ad_id become inf in a few minutes.

Success Criteria

Restore stable training (no NaNs/Infs) for at least 3 epochs on the 60M-row sample.
Achieve offline AUC-ROC ≥ 0.77 and AUC-PR ≥ 0.18 on a time-based holdout day.
Demonstrate a concrete gradient debugging plan that uses backprop artifacts (gradients, Jacobians, hooks) to localize the fault.
Provide at least two preventative guardrails to catch similar backprop/gradient regressions in CI.

Constraints

Training: 4×A10 GPUs, max 2 hours per experiment; mixed precision (AMP) is preferred for throughput.
Inference: p99 latency budget 8 ms on CPU for a single request; model must remain an MLP with embeddings (no large transformers).
Compliance: Must be able to explain why the model changed (feature attribution at a high level); cannot log raw user identifiers.

Deliverables

Explain, in the context of this CTR model, what backpropagation is computing and how it connects loss → parameter updates.
A step-by-step debugging plan to identify why gradients explode / become NaN (include what tensors to inspect and where).
A corrected training approach (loss, optimizer, initialization, normalization, clipping, AMP settings) that stabilizes backprop.
An evaluation plan (splits, metrics, thresholding) appropriate for highly imbalanced CTR.
Two production guardrails (unit tests / assertions / monitoring) that specifically validate backprop correctness and gradient health.

Business Context

Dataset

You have a logged dataset of ad impressions with delayed click labels.

Feature Group	Count	Examples	Notes
Dense numerical	18	price, discount_pct, historical_ctr_7d, time_since_last_purchase	Standardized; some heavy tails
Categorical (ID)	6	user_id, ad_id, query_id, category_id, device_type, geo	Embedded; high cardinality
Contextual	10	hour_of_day, day_of_week, session_depth, page_type	Mostly low-cardinality
Target	1	clicked (0/1)	Click within 24h window

Size: ~1.2B impressions over 14 days; training uses a 5% sample (~60M rows) for iteration speed.
Class balance: ~1.1% positive (clicked), 98.9% negative.
Missing data: ~8% missing in some dense features (e.g., historical_ctr_7d for new ads); ~2% missing geo.
Known issue symptoms (from training dashboards):
- Training loss decreases for ~200 steps then becomes nan.
- Gradient norm spikes from ~5 to >10,000.
- Embedding weights for ad_id become inf in a few minutes.

Success Criteria

Restore stable training (no NaNs/Infs) for at least 3 epochs on the 60M-row sample.
Achieve offline AUC-ROC ≥ 0.77 and AUC-PR ≥ 0.18 on a time-based holdout day.
Demonstrate a concrete gradient debugging plan that uses backprop artifacts (gradients, Jacobians, hooks) to localize the fault.
Provide at least two preventative guardrails to catch similar backprop/gradient regressions in CI.

Constraints

Training: 4×A10 GPUs, max 2 hours per experiment; mixed precision (AMP) is preferred for throughput.
Inference: p99 latency budget 8 ms on CPU for a single request; model must remain an MLP with embeddings (no large transformers).
Compliance: Must be able to explain why the model changed (feature attribution at a high level); cannot log raw user identifiers.

Deliverables

Explain, in the context of this CTR model, what backpropagation is computing and how it connects loss → parameter updates.
A step-by-step debugging plan to identify why gradients explode / become NaN (include what tensors to inspect and where).
A corrected training approach (loss, optimizer, initialization, normalization, clipping, AMP settings) that stabilizes backprop.
An evaluation plan (splits, metrics, thresholding) appropriate for highly imbalanced CTR.
Two production guardrails (unit tests / assertions / monitoring) that specifically validate backprop correctness and gradient health.

Business Context

Dataset

You have a logged dataset of ad impressions with delayed click labels.

Feature Group	Count	Examples	Notes
Dense numerical	18	price, discount_pct, historical_ctr_7d, time_since_last_purchase	Standardized; some heavy tails
Categorical (ID)	6	user_id, ad_id, query_id, category_id, device_type, geo	Embedded; high cardinality
Contextual	10	hour_of_day, day_of_week, session_depth, page_type	Mostly low-cardinality
Target	1	clicked (0/1)	Click within 24h window

Size: ~1.2B impressions over 14 days; training uses a 5% sample (~60M rows) for iteration speed.
Class balance: ~1.1% positive (clicked), 98.9% negative.
Missing data: ~8% missing in some dense features (e.g., historical_ctr_7d for new ads); ~2% missing geo.
Known issue symptoms (from training dashboards):
- Training loss decreases for ~200 steps then becomes nan.
- Gradient norm spikes from ~5 to >10,000.
- Embedding weights for ad_id become inf in a few minutes.

Success Criteria

Restore stable training (no NaNs/Infs) for at least 3 epochs on the 60M-row sample.
Achieve offline AUC-ROC ≥ 0.77 and AUC-PR ≥ 0.18 on a time-based holdout day.
Demonstrate a concrete gradient debugging plan that uses backprop artifacts (gradients, Jacobians, hooks) to localize the fault.
Provide at least two preventative guardrails to catch similar backprop/gradient regressions in CI.

Constraints

Training: 4×A10 GPUs, max 2 hours per experiment; mixed precision (AMP) is preferred for throughput.
Inference: p99 latency budget 8 ms on CPU for a single request; model must remain an MLP with embeddings (no large transformers).
Compliance: Must be able to explain why the model changed (feature attribution at a high level); cannot log raw user identifiers.

Deliverables

Explain, in the context of this CTR model, what backpropagation is computing and how it connects loss → parameter updates.
A step-by-step debugging plan to identify why gradients explode / become NaN (include what tensors to inspect and where).
A corrected training approach (loss, optimizer, initialization, normalization, clipping, AMP settings) that stabilizes backprop.
An evaluation plan (splits, metrics, thresholding) appropriate for highly imbalanced CTR.
Two production guardrails (unit tests / assertions / monitoring) that specifically validate backprop correctness and gradient health.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Backprop Debugging in Ad CTR Model

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Backprop Debugging in Ad CTR Model

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Backprop Debugging in Ad CTR Model

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer