Reinforcement Learning for Highway Autonomy

Business Context

You’re interviewing for an ML engineering role on the autonomy team at WayDrive, a ride-hailing company operating a geo-fenced Level-4 autonomous fleet in Phoenix and Austin. The fleet drives ~2.5 million miles/week and must satisfy strict internal safety gates before any policy update can be deployed. A recent incident review showed that the current rule-based planner is overly conservative during merges and lane changes, causing 7–12% longer ETAs and frequent “handoff-to-remote-assist” events that cost the business ~$18 per intervention.

Your task is to propose and implement a reinforcement learning (RL) approach to improve highway driving behavior (lane keeping, lane changes, merges) while meeting safety and latency constraints.

Dataset / Environment

You are given an offline dataset collected from a mixture of human-driven and autonomy-driven trajectories, plus a lightweight simulator for evaluation.

Component	Scale / Shape	Notes
Offline trajectories	12M episodes, avg 18s each	Logged at 10 Hz (180 steps/episode avg)
State (observation)	128-d float vector	Ego kinematics, lane geometry, lead/follow gaps, relative velocities, map cues
Action space	Continuous (3-d)	(steering_rate, throttle, brake) normalized to [-1, 1]
Rewards (logged)	scalar per step	Proxy reward: progress, comfort penalties, rule violations
Safety labels	per step	Collision, near-collision (TTC<1.0s), hard-brake, lane departure
Missingness	~2%	Intermittent sensor dropouts; some features NaN for 1–3 frames
Distribution shift	moderate	Night/rain underrepresented (only 6% of miles)

Target behavior: highway autonomy at 25–75 mph including lane changes and merges.

Success Criteria (what “good” looks like)

You must propose an RL solution that, in simulator evaluation on a held-out scenario suite:

Safety: Collision rate < 0.02 per 1,000 miles; near-collision (TTC<1.0s) < 0.5 per 1,000 miles.
Efficiency: Mean speed relative to traffic improves by ≥ 5% vs baseline planner without increasing safety events.
Comfort: Hard-brake events < 0.8 per 1,000 miles; jerk p95 decreases by ≥ 10%.
Stability: Policy does not oscillate between lanes; lane-change abort rate < 3%.

Constraints

Deployment latency: inference budget ≤ 20 ms on an automotive-grade GPU (e.g., Orin-class).
Safety/regulatory: must include a safety layer (constraint handling) and an offline evaluation plan before any on-road test.
Offline-first: you cannot do unlimited online exploration; training should primarily leverage the offline dataset.
Interpretability: provide actionable diagnostics (what scenarios fail, why).

Deliverables

Explain reinforcement learning in your own words using the MDP framing (state, action, reward, transition, policy) and map each element to this autonomous driving problem.
Choose an RL algorithm suitable for continuous control and offline data (e.g., IQL/CQL/TD3+BC) and justify the choice.
Define a reward function (or reward model) that encodes safety, comfort, and progress; discuss pitfalls like reward hacking.
Provide a training + evaluation plan, including scenario-based validation, off-policy evaluation, and how you’d detect distribution shift.
Provide implementation-quality code for a minimal offline RL training loop and evaluation metrics computation.

Business Context

Your task is to propose and implement a reinforcement learning (RL) approach to improve highway driving behavior (lane keeping, lane changes, merges) while meeting safety and latency constraints.

Dataset / Environment

You are given an offline dataset collected from a mixture of human-driven and autonomy-driven trajectories, plus a lightweight simulator for evaluation.

Component	Scale / Shape	Notes
Offline trajectories	12M episodes, avg 18s each	Logged at 10 Hz (180 steps/episode avg)
State (observation)	128-d float vector	Ego kinematics, lane geometry, lead/follow gaps, relative velocities, map cues
Action space	Continuous (3-d)	(steering_rate, throttle, brake) normalized to [-1, 1]
Rewards (logged)	scalar per step	Proxy reward: progress, comfort penalties, rule violations
Safety labels	per step	Collision, near-collision (TTC<1.0s), hard-brake, lane departure
Missingness	~2%	Intermittent sensor dropouts; some features NaN for 1–3 frames
Distribution shift	moderate	Night/rain underrepresented (only 6% of miles)

Target behavior: highway autonomy at 25–75 mph including lane changes and merges.

Success Criteria (what “good” looks like)

You must propose an RL solution that, in simulator evaluation on a held-out scenario suite:

Safety: Collision rate < 0.02 per 1,000 miles; near-collision (TTC<1.0s) < 0.5 per 1,000 miles.
Efficiency: Mean speed relative to traffic improves by ≥ 5% vs baseline planner without increasing safety events.
Comfort: Hard-brake events < 0.8 per 1,000 miles; jerk p95 decreases by ≥ 10%.
Stability: Policy does not oscillate between lanes; lane-change abort rate < 3%.

Constraints

Deployment latency: inference budget ≤ 20 ms on an automotive-grade GPU (e.g., Orin-class).
Safety/regulatory: must include a safety layer (constraint handling) and an offline evaluation plan before any on-road test.
Offline-first: you cannot do unlimited online exploration; training should primarily leverage the offline dataset.
Interpretability: provide actionable diagnostics (what scenarios fail, why).

Deliverables

Explain reinforcement learning in your own words using the MDP framing (state, action, reward, transition, policy) and map each element to this autonomous driving problem.
Choose an RL algorithm suitable for continuous control and offline data (e.g., IQL/CQL/TD3+BC) and justify the choice.
Define a reward function (or reward model) that encodes safety, comfort, and progress; discuss pitfalls like reward hacking.
Provide a training + evaluation plan, including scenario-based validation, off-policy evaluation, and how you’d detect distribution shift.
Provide implementation-quality code for a minimal offline RL training loop and evaluation metrics computation.

Business Context

Your task is to propose and implement a reinforcement learning (RL) approach to improve highway driving behavior (lane keeping, lane changes, merges) while meeting safety and latency constraints.

Dataset / Environment

You are given an offline dataset collected from a mixture of human-driven and autonomy-driven trajectories, plus a lightweight simulator for evaluation.

Component	Scale / Shape	Notes
Offline trajectories	12M episodes, avg 18s each	Logged at 10 Hz (180 steps/episode avg)
State (observation)	128-d float vector	Ego kinematics, lane geometry, lead/follow gaps, relative velocities, map cues
Action space	Continuous (3-d)	(steering_rate, throttle, brake) normalized to [-1, 1]
Rewards (logged)	scalar per step	Proxy reward: progress, comfort penalties, rule violations
Safety labels	per step	Collision, near-collision (TTC<1.0s), hard-brake, lane departure
Missingness	~2%	Intermittent sensor dropouts; some features NaN for 1–3 frames
Distribution shift	moderate	Night/rain underrepresented (only 6% of miles)

Target behavior: highway autonomy at 25–75 mph including lane changes and merges.

Success Criteria (what “good” looks like)

You must propose an RL solution that, in simulator evaluation on a held-out scenario suite:

Safety: Collision rate < 0.02 per 1,000 miles; near-collision (TTC<1.0s) < 0.5 per 1,000 miles.
Efficiency: Mean speed relative to traffic improves by ≥ 5% vs baseline planner without increasing safety events.
Comfort: Hard-brake events < 0.8 per 1,000 miles; jerk p95 decreases by ≥ 10%.
Stability: Policy does not oscillate between lanes; lane-change abort rate < 3%.

Constraints

Deployment latency: inference budget ≤ 20 ms on an automotive-grade GPU (e.g., Orin-class).
Safety/regulatory: must include a safety layer (constraint handling) and an offline evaluation plan before any on-road test.
Offline-first: you cannot do unlimited online exploration; training should primarily leverage the offline dataset.
Interpretability: provide actionable diagnostics (what scenarios fail, why).

Deliverables

Explain reinforcement learning in your own words using the MDP framing (state, action, reward, transition, policy) and map each element to this autonomous driving problem.
Choose an RL algorithm suitable for continuous control and offline data (e.g., IQL/CQL/TD3+BC) and justify the choice.
Define a reward function (or reward model) that encodes safety, comfort, and progress; discuss pitfalls like reward hacking.
Provide a training + evaluation plan, including scenario-based validation, off-policy evaluation, and how you’d detect distribution shift.
Provide implementation-quality code for a minimal offline RL training loop and evaluation metrics computation.

Business Context

Your task is to propose and implement a reinforcement learning (RL) approach to improve highway driving behavior (lane keeping, lane changes, merges) while meeting safety and latency constraints.

Dataset / Environment

You are given an offline dataset collected from a mixture of human-driven and autonomy-driven trajectories, plus a lightweight simulator for evaluation.

Component	Scale / Shape	Notes
Offline trajectories	12M episodes, avg 18s each	Logged at 10 Hz (180 steps/episode avg)
State (observation)	128-d float vector	Ego kinematics, lane geometry, lead/follow gaps, relative velocities, map cues
Action space	Continuous (3-d)	(steering_rate, throttle, brake) normalized to [-1, 1]
Rewards (logged)	scalar per step	Proxy reward: progress, comfort penalties, rule violations
Safety labels	per step	Collision, near-collision (TTC<1.0s), hard-brake, lane departure
Missingness	~2%	Intermittent sensor dropouts; some features NaN for 1–3 frames
Distribution shift	moderate	Night/rain underrepresented (only 6% of miles)

Target behavior: highway autonomy at 25–75 mph including lane changes and merges.

Success Criteria (what “good” looks like)

You must propose an RL solution that, in simulator evaluation on a held-out scenario suite:

Safety: Collision rate < 0.02 per 1,000 miles; near-collision (TTC<1.0s) < 0.5 per 1,000 miles.
Efficiency: Mean speed relative to traffic improves by ≥ 5% vs baseline planner without increasing safety events.
Comfort: Hard-brake events < 0.8 per 1,000 miles; jerk p95 decreases by ≥ 10%.
Stability: Policy does not oscillate between lanes; lane-change abort rate < 3%.

Constraints

Deployment latency: inference budget ≤ 20 ms on an automotive-grade GPU (e.g., Orin-class).
Safety/regulatory: must include a safety layer (constraint handling) and an offline evaluation plan before any on-road test.
Offline-first: you cannot do unlimited online exploration; training should primarily leverage the offline dataset.
Interpretability: provide actionable diagnostics (what scenarios fail, why).

Deliverables

Explain reinforcement learning in your own words using the MDP framing (state, action, reward, transition, policy) and map each element to this autonomous driving problem.
Choose an RL algorithm suitable for continuous control and offline data (e.g., IQL/CQL/TD3+BC) and justify the choice.
Define a reward function (or reward model) that encodes safety, comfort, and progress; discuss pitfalls like reward hacking.
Provide a training + evaluation plan, including scenario-based validation, off-policy evaluation, and how you’d detect distribution shift.
Provide implementation-quality code for a minimal offline RL training loop and evaluation metrics computation.

Interview Guides

Business Context

Dataset / Environment

Success Criteria (what “good” looks like)

Constraints

Deliverables

Reinforcement Learning for Highway Autonomy

Business Context

Dataset / Environment

Success Criteria (what “good” looks like)

Constraints

Deliverables

Your Answer

Reinforcement Learning for Highway Autonomy

Business Context

Dataset / Environment

Success Criteria (what “good” looks like)

Constraints

Deliverables

Reinforcement Learning for Highway Autonomy

Business Context

Dataset / Environment

Success Criteria (what “good” looks like)

Constraints

Deliverables

Your Answer