RoboCourier is training an imitation-learning policy for low-speed warehouse cart navigation across 12 fulfillment centers. A behavior cloning model trained only on expert demonstrations performs well offline but degrades sharply during live rollouts because small mistakes push the cart into states the expert dataset rarely contains.
You are given logged state-action trajectories from a human teleoperator and additional states collected from policy rollouts that can be re-labeled by the expert. The goal is to implement and evaluate a DAgger-style training loop that reduces compounding errors versus standard behavior cloning.
| Feature Group | Count | Examples |
|---|---|---|
| Sensor state | 96 | lidar bins, depth summaries, obstacle distances |
| Kinematics | 8 | speed, yaw rate, steering angle, acceleration |
| Route context | 12 | waypoint heading error, distance-to-goal, aisle width |
| Action labels | 2 | steering command, throttle command |
| Metadata | 5 | site_id, robot_id, timestamp, episode_id, intervention_flag |
A strong solution should show that DAgger improves closed-loop performance over behavior cloning by reducing intervention rate and cumulative trajectory deviation. Good enough means at least a 25% reduction in interventions per 1000 meters while keeping inference under 20 ms per step.