Stabilize Deep Vision Training with ResNets

Business Context

You’re on the Computer Vision team at ShopNow, a global e-commerce marketplace with 35M DAUs. A core revenue lever is visual search: users take a photo of an item and the app retrieves similar products. The current model is a deep CNN used for image embedding + classification (category + attributes). Over the last quarter, the team attempted to increase model depth to improve retrieval quality, but training became unstable: loss plateaus early, deeper layers learn slowly, and the model underperforms a shallower baseline.

You’re asked to explain the vanishing gradient problem in this context and propose a production-ready redesign using ResNet-style residual connections that trains reliably and meets mobile inference constraints.

Dataset

You have an internal dataset of product images and labels.

Component	Details
Scale	12.4M images, 224×224 RGB, ~3.1TB on S3
Labels	1,200 categories (long-tail), plus 8 binary attributes (e.g., “has_logo”, “striped”)
Split	Time-based: train (last 9 months), val (next 1 month), test (most recent 2 weeks)
Class balance	Head classes ~200K images; tail classes <200 images; effective imbalance ~1:1000
Noise	~2–4% label noise from seller-provided metadata
Missingness	15% of attribute labels missing (unknown), category always present

Success Criteria

Training stability: no early loss plateau; gradients in early layers remain non-trivial (define how you’d measure this).
Quality (on test):
- Category top-1 accuracy ≥ 72% (baseline deep plain CNN is ~66%; shallow CNN is ~69%).
- Category top-5 accuracy ≥ 90%.
- Macro-F1 on the 8 attributes ≥ 0.62 (ignoring missing labels).
Production constraints:
- On-device inference (mid-tier Android): p95 latency ≤ 45 ms, model size ≤ 35 MB.
- Training budget: 8×A100 GPUs, max 24 hours for a full training run.

Constraints / Requirements

You must avoid data leakage (time split is mandatory).
You must handle long-tail imbalance without exploding false positives on head classes.
You must explain, concretely, why deeper plain networks fail here and why residual connections help.
You must propose how you’d monitor gradient health during training and what thresholds/alerts you’d set.

Deliverables

A clear explanation of vanishing gradients in deep networks (in terms of backprop Jacobians, activation functions, and normalization).
A ResNet-based architecture proposal that fits the latency/size budget (e.g., ResNet-18/34 variant, bottlenecks, width multipliers).
A training plan: optimizer, LR schedule, normalization, regularization, and imbalance handling.
An evaluation plan with metrics (including long-tail-aware metrics) and acceptance thresholds.
A brief production plan: export format, quantization strategy, and how you’d validate no regression after quantization.

Business Context

Dataset

You have an internal dataset of product images and labels.

Component	Details
Scale	12.4M images, 224×224 RGB, ~3.1TB on S3
Labels	1,200 categories (long-tail), plus 8 binary attributes (e.g., “has_logo”, “striped”)
Split	Time-based: train (last 9 months), val (next 1 month), test (most recent 2 weeks)
Class balance	Head classes ~200K images; tail classes <200 images; effective imbalance ~1:1000
Noise	~2–4% label noise from seller-provided metadata
Missingness	15% of attribute labels missing (unknown), category always present

Success Criteria

Training stability: no early loss plateau; gradients in early layers remain non-trivial (define how you’d measure this).
Quality (on test):
- Category top-1 accuracy ≥ 72% (baseline deep plain CNN is ~66%; shallow CNN is ~69%).
- Category top-5 accuracy ≥ 90%.
- Macro-F1 on the 8 attributes ≥ 0.62 (ignoring missing labels).
Production constraints:
- On-device inference (mid-tier Android): p95 latency ≤ 45 ms, model size ≤ 35 MB.
- Training budget: 8×A100 GPUs, max 24 hours for a full training run.

Constraints / Requirements

You must avoid data leakage (time split is mandatory).
You must handle long-tail imbalance without exploding false positives on head classes.
You must explain, concretely, why deeper plain networks fail here and why residual connections help.
You must propose how you’d monitor gradient health during training and what thresholds/alerts you’d set.

Deliverables

A clear explanation of vanishing gradients in deep networks (in terms of backprop Jacobians, activation functions, and normalization).
A ResNet-based architecture proposal that fits the latency/size budget (e.g., ResNet-18/34 variant, bottlenecks, width multipliers).
A training plan: optimizer, LR schedule, normalization, regularization, and imbalance handling.
An evaluation plan with metrics (including long-tail-aware metrics) and acceptance thresholds.
A brief production plan: export format, quantization strategy, and how you’d validate no regression after quantization.

Business Context

Dataset

You have an internal dataset of product images and labels.

Component	Details
Scale	12.4M images, 224×224 RGB, ~3.1TB on S3
Labels	1,200 categories (long-tail), plus 8 binary attributes (e.g., “has_logo”, “striped”)
Split	Time-based: train (last 9 months), val (next 1 month), test (most recent 2 weeks)
Class balance	Head classes ~200K images; tail classes <200 images; effective imbalance ~1:1000
Noise	~2–4% label noise from seller-provided metadata
Missingness	15% of attribute labels missing (unknown), category always present

Success Criteria

Training stability: no early loss plateau; gradients in early layers remain non-trivial (define how you’d measure this).
Quality (on test):
- Category top-1 accuracy ≥ 72% (baseline deep plain CNN is ~66%; shallow CNN is ~69%).
- Category top-5 accuracy ≥ 90%.
- Macro-F1 on the 8 attributes ≥ 0.62 (ignoring missing labels).
Production constraints:
- On-device inference (mid-tier Android): p95 latency ≤ 45 ms, model size ≤ 35 MB.
- Training budget: 8×A100 GPUs, max 24 hours for a full training run.

Constraints / Requirements

You must avoid data leakage (time split is mandatory).
You must handle long-tail imbalance without exploding false positives on head classes.
You must explain, concretely, why deeper plain networks fail here and why residual connections help.
You must propose how you’d monitor gradient health during training and what thresholds/alerts you’d set.

Deliverables

A clear explanation of vanishing gradients in deep networks (in terms of backprop Jacobians, activation functions, and normalization).
A ResNet-based architecture proposal that fits the latency/size budget (e.g., ResNet-18/34 variant, bottlenecks, width multipliers).
A training plan: optimizer, LR schedule, normalization, regularization, and imbalance handling.
An evaluation plan with metrics (including long-tail-aware metrics) and acceptance thresholds.
A brief production plan: export format, quantization strategy, and how you’d validate no regression after quantization.

Business Context

Dataset

You have an internal dataset of product images and labels.

Component	Details
Scale	12.4M images, 224×224 RGB, ~3.1TB on S3
Labels	1,200 categories (long-tail), plus 8 binary attributes (e.g., “has_logo”, “striped”)
Split	Time-based: train (last 9 months), val (next 1 month), test (most recent 2 weeks)
Class balance	Head classes ~200K images; tail classes <200 images; effective imbalance ~1:1000
Noise	~2–4% label noise from seller-provided metadata
Missingness	15% of attribute labels missing (unknown), category always present

Success Criteria

Training stability: no early loss plateau; gradients in early layers remain non-trivial (define how you’d measure this).
Quality (on test):
- Category top-1 accuracy ≥ 72% (baseline deep plain CNN is ~66%; shallow CNN is ~69%).
- Category top-5 accuracy ≥ 90%.
- Macro-F1 on the 8 attributes ≥ 0.62 (ignoring missing labels).
Production constraints:
- On-device inference (mid-tier Android): p95 latency ≤ 45 ms, model size ≤ 35 MB.
- Training budget: 8×A100 GPUs, max 24 hours for a full training run.

Constraints / Requirements

You must avoid data leakage (time split is mandatory).
You must handle long-tail imbalance without exploding false positives on head classes.
You must explain, concretely, why deeper plain networks fail here and why residual connections help.
You must propose how you’d monitor gradient health during training and what thresholds/alerts you’d set.

Deliverables

A clear explanation of vanishing gradients in deep networks (in terms of backprop Jacobians, activation functions, and normalization).
A ResNet-based architecture proposal that fits the latency/size budget (e.g., ResNet-18/34 variant, bottlenecks, width multipliers).
A training plan: optimizer, LR schedule, normalization, regularization, and imbalance handling.
An evaluation plan with metrics (including long-tail-aware metrics) and acceptance thresholds.
A brief production plan: export format, quantization strategy, and how you’d validate no regression after quantization.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints / Requirements

Deliverables

Stabilize Deep Vision Training with ResNets

Business Context

Dataset

Success Criteria

Constraints / Requirements

Deliverables

Your Answer

Stabilize Deep Vision Training with ResNets

Business Context

Dataset

Success Criteria

Constraints / Requirements

Deliverables

Stabilize Deep Vision Training with ResNets

Business Context

Dataset

Success Criteria

Constraints / Requirements

Deliverables

Your Answer