Business Context
ShopLens trains image classification models to detect catalog quality issues before products go live. The ML platform team wants to maximize GPU throughput during training on a fixed fleet of 24 GB NVIDIA GPUs without triggering out-of-memory failures.
Dataset
You are given a product image classification dataset used for offline model training.
| Feature Group | Count | Examples |
|---|
| Images | 1.2M | 224x224 RGB product photos from mobile and studio uploads |
| Labels | 12 classes | blurry, duplicate, low_light, background_issue, watermark, good_image |
| Metadata | 6 | source_channel, device_type, upload_region, aspect_ratio |
| Splits | 3 | train, validation, test |
- Size: 1.2M images, 12 classes
- Target: Multiclass classification of image quality category
- Class balance: Moderately imbalanced; largest class 41%, smallest class 3.5%
- Missing data: ~2% corrupted image files; metadata missing in 8% of records
Success Criteria
A good solution should:
- Find the largest stable batch size per GPU that avoids OOM errors
- Improve GPU utilization and images/second throughput versus a conservative baseline batch size of 32
- Maintain validation macro-F1 within 1 point of the baseline training configuration
- Provide a reproducible method that can be reused across models and hardware types
Constraints
- Training runs on a single 24 GB GPU in the interview environment
- Candidate should account for mixed precision, gradient accumulation, and variable memory usage across augmentations
- The approach must be robust enough for production training jobs, not just a one-off manual guess
Deliverables
- Design a method to automatically search for the maximum safe batch size.
- Train and evaluate a baseline and optimized configuration.
- Report throughput, peak GPU memory, training stability, and validation quality.
- Explain tradeoffs between larger batch sizes, convergence behavior, and hardware efficiency.
- Show how you would productionize this into a reusable training utility.