Context
WayDrive operates a Level-4 autonomous ride-hailing fleet in Phoenix and Austin with 1,200 vehicles and ~180k rider trips/day. The autonomy stack is updated weekly. Before any release can be promoted to an on-road “shadow” phase (running in parallel without controlling the car) and then to limited on-road control, the team relies heavily on a large-scale simulator: ~40M simulated miles/week across curated scenario libraries (unprotected left turns, cut-ins, construction zones, emergency vehicles, etc.).
The simulator produces a set of offline metrics used as gatekeepers for promotion. However, leadership has noticed a recurring issue: releases that look strong in simulation sometimes underperform on-road, and occasionally the reverse. This is a safety-critical domain: a bad promotion decision can increase hard braking events, safety driver interventions, and in the worst case, collisions. At the same time, being overly conservative slows iteration and delays expansion to new geofences.
Current Release Data (Last 8 Weekly Builds)
For each weekly build, you have aggregate simulator metrics and the subsequent 7-day on-road shadow evaluation (same routes, same time windows, matched weather where possible). “On-road” metrics are computed from safety driver logs and vehicle telemetry.
Simulator metrics (offline):
- Collision rate per 1,000 sim miles (lower is better)
- Mean jerk (m/s³) (lower is better; proxy for comfort/control)
- Route completion rate (higher is better)
- Planner constraint violations per 100 miles (lower is better)
On-road metrics (real-world):
- Disengagements per 1,000 miles (lower is better)
- Safety-critical interventions per 10,000 miles (lower is better)
- Hard braking events per 100 miles (lower is better)
Performance Snapshot (Aggregated Across Builds)
| Metric | Simulator (Offline) | On-road (7-day Shadow) |
|---|
| Collision rate | 0.18 / 1,000 mi | 0.06 / 1,000 mi (proxy: safety-critical interventions) |
| Mean jerk | 1.42 m/s³ | 1.10 m/s³ (telemetry-derived) |
| Route completion | 96.8% | 94.9% (mission completion) |
| Constraint violations | 2.6 / 100 mi | 1.9 / 100 mi (planner alerts) |
| — | — | Disengagements: 0.92 / 1,000 mi |
You also computed correlations across the 8 builds between each simulator metric and on-road disengagement rate:
| Simulator Metric vs On-road Disengagements | Pearson r |
|---|
| Sim collision rate | 0.31 |
| Sim mean jerk | 0.12 |
| Sim route completion | -0.28 |
| Sim constraint violations | 0.55 |
The Problem
The VP of Autonomy asks you to validate whether simulation metrics actually predict real-world road performance well enough to be used as promotion gates. They want a rigorous plan that addresses:
- Small sample sizes (only a handful of builds per month)
- Distribution shift (sim scenarios are curated; road is messy)
- Metric gaming (teams optimizing sim metrics without improving real safety)
- Safety risk (false confidence is unacceptable)
Your Task (What you must deliver)
- Define what “correlate” should mean here: which on-road outcomes are the true north (disengagements, interventions, collisions, comfort), and which simulator metrics should be treated as leading indicators vs weak proxies.
- Propose a statistical validation design to quantify predictive relationship(s) between simulator metrics and on-road outcomes, including:
- correlation vs regression vs rank-based agreement
- how you would handle non-linearity, heteroskedasticity, and outliers
- confidence intervals and uncertainty reporting given only 8 builds
- Describe how you would test for calibration: whether a given simulator risk score maps to an expected on-road risk level, not just relative ordering.
- Recommend a gating policy that uses simulation + limited on-road shadow data to minimize safety risk (e.g., conservative thresholds, two-stage gates, segment-specific gates).
- Explain how you would detect and mitigate sim-to-real gaps (scenario coverage gaps, sensor realism, behavior model mismatch), and how you’d update the simulator metric suite.
Constraints
- On-road controlled testing is expensive and limited: only 50k controlled miles/week are available for pre-production evaluation.
- Safety team requires that any new build must not increase safety-critical interventions by more than +5% relative to the current production build with 95% confidence.
- The fleet operates in multiple ODD segments (downtown, suburban arterials, highways). A single global metric may hide segment regressions.
- Labels for true collisions are rare; interventions and near-misses are used as proxies but are noisy and subject to reporting bias.