Business Context
FulfillFast operates 12 automated warehouses and wants an object detection model to identify pallets, forklifts, workers, and damaged boxes from ceiling-mounted cameras. The model will support safety alerts and inventory monitoring, so both detection quality and inference speed matter.
Dataset
You are given a labeled object detection dataset collected from warehouse cameras over 6 months.
| Feature Group | Count | Examples |
|---|
| Images | 120,000 | 1280x720 RGB frames from 48 cameras |
| Classes | 4 | pallet, forklift, worker, damaged_box |
| Bounding boxes | 410,000 | x_min, y_min, x_max, y_max, class_id |
| Metadata | 6 | camera_id, timestamp, warehouse_id, lighting_condition, shift, weather |
- Size: 120K images, ~410K annotated boxes
- Target: Detect and localize all objects in each image
- Class balance: Imbalanced — pallets 52%, forklifts 21%, workers 19%, damaged_box 8%
- Missing data: ~3% of images have incomplete metadata; labels contain occasional noisy boxes from manual annotation
Success Criteria
A good solution should achieve mAP@0.5 >= 0.78 overall, damaged_box AP >= 0.60, and single-image inference latency under 60 ms on a T4 GPU. The approach should also explain when YOLO is preferable to Faster R-CNN and when it is not.
Constraints
- Near-real-time inference for live camera feeds
- Limited GPU budget for training and deployment
- Small-object detection for damaged boxes is important
- The safety team needs per-class error analysis, not just one aggregate metric
Deliverables
- Propose an object detection architecture and justify the choice against at least one alternative (for example, YOLO vs Faster R-CNN).
- Build a training and evaluation pipeline for the dataset.
- Handle class imbalance, annotation noise, and train/validation/test splitting without leakage across cameras or time.
- Report detection metrics by class and discuss production tradeoffs.
- Provide deployment recommendations for batch retraining and online inference.