RoboFleet operates 2,000 autonomous mobile robots across fulfillment centers. The robotics team wants a reinforcement learning policy that improves navigation efficiency and collision avoidance in simulation before controlled deployment on real robots.
Unlike supervised learning, this problem uses logged interaction trajectories collected from a simulator and prior robot controllers. You are given offline rollout data for initialization and an online training environment for policy improvement.
| Data Component | Size | Examples |
|---|---|---|
| State vectors | 12M timesteps | lidar summary bins, robot velocity, goal distance, heading error, battery level |
| Actions | 12M timesteps | linear velocity, angular velocity |
| Rewards | 12M timesteps | progress-to-goal reward, collision penalty, time penalty, success bonus |
| Episode metadata | 180K episodes | map_id, obstacle density, payload weight, floor type |
| Safety labels | 180K episodes | collision_count, emergency_stop, timeout |
A good solution should improve mean episodic return by at least 20% over the rule-based baseline, reduce collision rate below 5%, and maintain inference latency under 10 ms per control step on edge hardware.