Business Context
You’re on the Computer Vision team at ShopNow, a global e-commerce marketplace with 35M DAUs. A core revenue lever is visual search: users take a photo of an item and the app retrieves similar products. The current model is a deep CNN used for image embedding + classification (category + attributes). Over the last quarter, the team attempted to increase model depth to improve retrieval quality, but training became unstable: loss plateaus early, deeper layers learn slowly, and the model underperforms a shallower baseline.
You’re asked to explain the vanishing gradient problem in this context and propose a production-ready redesign using ResNet-style residual connections that trains reliably and meets mobile inference constraints.
Dataset
You have an internal dataset of product images and labels.
| Component | Details |
|---|
| Scale | 12.4M images, 224×224 RGB, ~3.1TB on S3 |
| Labels | 1,200 categories (long-tail), plus 8 binary attributes (e.g., “has_logo”, “striped”) |
| Split | Time-based: train (last 9 months), val (next 1 month), test (most recent 2 weeks) |
| Class balance | Head classes ~200K images; tail classes <200 images; effective imbalance ~1:1000 |
| Noise | ~2–4% label noise from seller-provided metadata |
| Missingness | 15% of attribute labels missing (unknown), category always present |
Success Criteria
- Training stability: no early loss plateau; gradients in early layers remain non-trivial (define how you’d measure this).
- Quality (on test):
- Category top-1 accuracy ≥ 72% (baseline deep plain CNN is ~66%; shallow CNN is ~69%).
- Category top-5 accuracy ≥ 90%.
- Macro-F1 on the 8 attributes ≥ 0.62 (ignoring missing labels).
- Production constraints:
- On-device inference (mid-tier Android): p95 latency ≤ 45 ms, model size ≤ 35 MB.
- Training budget: 8×A100 GPUs, max 24 hours for a full training run.
Constraints / Requirements
- You must avoid data leakage (time split is mandatory).
- You must handle long-tail imbalance without exploding false positives on head classes.
- You must explain, concretely, why deeper plain networks fail here and why residual connections help.
- You must propose how you’d monitor gradient health during training and what thresholds/alerts you’d set.
Deliverables
- A clear explanation of vanishing gradients in deep networks (in terms of backprop Jacobians, activation functions, and normalization).
- A ResNet-based architecture proposal that fits the latency/size budget (e.g., ResNet-18/34 variant, bottlenecks, width multipliers).
- A training plan: optimizer, LR schedule, normalization, regularization, and imbalance handling.
- An evaluation plan with metrics (including long-tail-aware metrics) and acceptance thresholds.
- A brief production plan: export format, quantization strategy, and how you’d validate no regression after quantization.