Optimize GPU Batch Size for Vision Training

Business Context

ShopLens trains image classification models to detect catalog quality issues before products go live. The ML platform team wants to maximize GPU throughput during training on a fixed fleet of 24 GB NVIDIA GPUs without triggering out-of-memory failures.

Dataset

You are given a product image classification dataset used for offline model training.

Feature Group	Count	Examples
Images	1.2M	224x224 RGB product photos from mobile and studio uploads
Labels	12 classes	blurry, duplicate, low_light, background_issue, watermark, good_image
Metadata	6	source_channel, device_type, upload_region, aspect_ratio
Splits	3	train, validation, test

Size: 1.2M images, 12 classes
Target: Multiclass classification of image quality category
Class balance: Moderately imbalanced; largest class 41%, smallest class 3.5%
Missing data: ~2% corrupted image files; metadata missing in 8% of records

Success Criteria

A good solution should:

Find the largest stable batch size per GPU that avoids OOM errors
Improve GPU utilization and images/second throughput versus a conservative baseline batch size of 32
Maintain validation macro-F1 within 1 point of the baseline training configuration
Provide a reproducible method that can be reused across models and hardware types

Constraints

Training runs on a single 24 GB GPU in the interview environment
Candidate should account for mixed precision, gradient accumulation, and variable memory usage across augmentations
The approach must be robust enough for production training jobs, not just a one-off manual guess

Deliverables

Design a method to automatically search for the maximum safe batch size.
Train and evaluate a baseline and optimized configuration.
Report throughput, peak GPU memory, training stability, and validation quality.
Explain tradeoffs between larger batch sizes, convergence behavior, and hardware efficiency.
Show how you would productionize this into a reusable training utility.

Business Context

Dataset

You are given a product image classification dataset used for offline model training.

Feature Group	Count	Examples
Images	1.2M	224x224 RGB product photos from mobile and studio uploads
Labels	12 classes	blurry, duplicate, low_light, background_issue, watermark, good_image
Metadata	6	source_channel, device_type, upload_region, aspect_ratio
Splits	3	train, validation, test

Size: 1.2M images, 12 classes
Target: Multiclass classification of image quality category
Class balance: Moderately imbalanced; largest class 41%, smallest class 3.5%
Missing data: ~2% corrupted image files; metadata missing in 8% of records

Success Criteria

A good solution should:

Find the largest stable batch size per GPU that avoids OOM errors
Improve GPU utilization and images/second throughput versus a conservative baseline batch size of 32
Maintain validation macro-F1 within 1 point of the baseline training configuration
Provide a reproducible method that can be reused across models and hardware types

Constraints

Training runs on a single 24 GB GPU in the interview environment
Candidate should account for mixed precision, gradient accumulation, and variable memory usage across augmentations
The approach must be robust enough for production training jobs, not just a one-off manual guess

Deliverables

Design a method to automatically search for the maximum safe batch size.
Train and evaluate a baseline and optimized configuration.
Report throughput, peak GPU memory, training stability, and validation quality.
Explain tradeoffs between larger batch sizes, convergence behavior, and hardware efficiency.
Show how you would productionize this into a reusable training utility.

Business Context

Dataset

You are given a product image classification dataset used for offline model training.

Feature Group	Count	Examples
Images	1.2M	224x224 RGB product photos from mobile and studio uploads
Labels	12 classes	blurry, duplicate, low_light, background_issue, watermark, good_image
Metadata	6	source_channel, device_type, upload_region, aspect_ratio
Splits	3	train, validation, test

Size: 1.2M images, 12 classes
Target: Multiclass classification of image quality category
Class balance: Moderately imbalanced; largest class 41%, smallest class 3.5%
Missing data: ~2% corrupted image files; metadata missing in 8% of records

Success Criteria

A good solution should:

Find the largest stable batch size per GPU that avoids OOM errors
Improve GPU utilization and images/second throughput versus a conservative baseline batch size of 32
Maintain validation macro-F1 within 1 point of the baseline training configuration
Provide a reproducible method that can be reused across models and hardware types

Constraints

Training runs on a single 24 GB GPU in the interview environment
Candidate should account for mixed precision, gradient accumulation, and variable memory usage across augmentations
The approach must be robust enough for production training jobs, not just a one-off manual guess

Deliverables

Design a method to automatically search for the maximum safe batch size.
Train and evaluate a baseline and optimized configuration.
Report throughput, peak GPU memory, training stability, and validation quality.
Explain tradeoffs between larger batch sizes, convergence behavior, and hardware efficiency.
Show how you would productionize this into a reusable training utility.

Business Context

Dataset

You are given a product image classification dataset used for offline model training.

Feature Group	Count	Examples
Images	1.2M	224x224 RGB product photos from mobile and studio uploads
Labels	12 classes	blurry, duplicate, low_light, background_issue, watermark, good_image
Metadata	6	source_channel, device_type, upload_region, aspect_ratio
Splits	3	train, validation, test

Size: 1.2M images, 12 classes
Target: Multiclass classification of image quality category
Class balance: Moderately imbalanced; largest class 41%, smallest class 3.5%
Missing data: ~2% corrupted image files; metadata missing in 8% of records

Success Criteria

A good solution should:

Find the largest stable batch size per GPU that avoids OOM errors
Improve GPU utilization and images/second throughput versus a conservative baseline batch size of 32
Maintain validation macro-F1 within 1 point of the baseline training configuration
Provide a reproducible method that can be reused across models and hardware types

Constraints

Training runs on a single 24 GB GPU in the interview environment
Candidate should account for mixed precision, gradient accumulation, and variable memory usage across augmentations
The approach must be robust enough for production training jobs, not just a one-off manual guess

Deliverables

Design a method to automatically search for the maximum safe batch size.
Train and evaluate a baseline and optimized configuration.
Report throughput, peak GPU memory, training stability, and validation quality.
Explain tradeoffs between larger batch sizes, convergence behavior, and hardware efficiency.
Show how you would productionize this into a reusable training utility.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Optimize GPU Batch Size for Vision Training

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Optimize GPU Batch Size for Vision Training

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Optimize GPU Batch Size for Vision Training

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer