You’re interviewing for an ML engineering role on the autonomy team at WayDrive, a ride-hailing company operating a geo-fenced Level-4 autonomous fleet in Phoenix and Austin. The fleet drives ~2.5 million miles/week and must satisfy strict internal safety gates before any policy update can be deployed. A recent incident review showed that the current rule-based planner is overly conservative during merges and lane changes, causing 7–12% longer ETAs and frequent “handoff-to-remote-assist” events that cost the business ~$18 per intervention.
Your task is to propose and implement a reinforcement learning (RL) approach to improve highway driving behavior (lane keeping, lane changes, merges) while meeting safety and latency constraints.
You are given an offline dataset collected from a mixture of human-driven and autonomy-driven trajectories, plus a lightweight simulator for evaluation.
| Component | Scale / Shape | Notes |
|---|---|---|
| Offline trajectories | 12M episodes, avg 18s each | Logged at 10 Hz (180 steps/episode avg) |
| State (observation) | 128-d float vector | Ego kinematics, lane geometry, lead/follow gaps, relative velocities, map cues |
| Action space | Continuous (3-d) | (steering_rate, throttle, brake) normalized to [-1, 1] |
| Rewards (logged) | scalar per step | Proxy reward: progress, comfort penalties, rule violations |
| Safety labels | per step | Collision, near-collision (TTC<1.0s), hard-brake, lane departure |
| Missingness | ~2% | Intermittent sensor dropouts; some features NaN for 1–3 frames |
| Distribution shift | moderate | Night/rain underrepresented (only 6% of miles) |
Target behavior: highway autonomy at 25–75 mph including lane changes and merges.
You must propose an RL solution that, in simulator evaluation on a held-out scenario suite: