You’re on the perception team at MetroDrive, a ride-hailing + autonomous delivery company operating in 6 major US cities. Your vehicles run an on-device camera model that classifies nearby vehicles into categories used by downstream planning: {car, truck, bus, motorcycle, bicycle, emergency_vehicle}. Missing an emergency vehicle (ambulance, fire truck, police) is a safety-critical failure: the planner may not yield or may choose an unsafe maneuver. However, emergency vehicles are rare in the training data (a classic long-tail problem), and the model “rarely sees” them during training.
The current model performs well on common classes but has poor recall on emergency_vehicle, especially at night, in rain, and when lights are partially occluded.
You have a curated dataset from the last 90 days of fleet driving.
| Component | Scale / Details |
|---|---|
| Raw frames | 220M frames (30 FPS video), sampled to 1 FPS for labeling candidates |
| Labeled images | 3.2M images with a primary vehicle label (single-label classification) |
| Classes | car, truck, bus, motorcycle, bicycle, emergency_vehicle |
| Class distribution | car 71%, truck 18%, bus 4.5%, motorcycle 3.2%, bicycle 3.0%, emergency_vehicle 0.3% |
| Features available | image pixels; metadata: city, time_of_day, weather, camera_id, speed, road_type |
| Label noise | ~1–2% overall; higher for emergency vehicles due to ambiguity (e.g., tow trucks with lights) |
| Deployment | On-device (edge GPU), max 15 ms per frame end-to-end for this classifier |
Your goal is to improve performance on the long-tail class without breaking overall system behavior.