You’re the analytics lead embedded with Waymo’s Perception team, specifically the pedestrian detection squad powering an L4 autonomous ride-hailing service operating in Phoenix and San Francisco. The fleet drives ~1.8M autonomous miles/week and completes ~220k rider trips/week. Safety regulators and city partners require transparent reporting, and internally the company has a quarterly goal to reduce safety-critical events without degrading rider experience (e.g., excessive hard braking).
Over the last month, the team shipped two changes: (1) a new camera model with different noise characteristics and (2) a model update that improved offline mAP on a benchmark dataset. However, the on-road safety review board is concerned: simulated “near-miss” events increased 9% week-over-week in the closed-loop simulator, while on-road disengagements stayed flat. Engineering argues the model is “better” per offline metrics; Safety argues the real-world risk may be rising.
Your PM asks you to define a single primary metric (a North Star KPI) for the pedestrian detection team that can be used in weekly business reviews and release gates. The metric must: (a) align with real-world safety risk, (b) be measurable with available data at scale, (c) be robust to changes in fleet mix and geography, and (d) be hard to game by simply changing thresholds.
| Source | What it contains | Granularity |
|---|---|---|
perception_detections | model version, timestamp, object_id, class=pedestrian, confidence score, 3D box, track continuity | per frame / per track |
sensor_fusion_groundtruth | human-labeled clips for sampled miles; pedestrian presence, location, occlusion, distance | per clip |
planner_events | TTC (time-to-collision), hard brake, yield/stop decisions, predicted trajectories | per event |
autonomy_disengagements | disengagement reason codes, speed, location, preceding objects | per disengagement |
fleet_miles | miles driven by city, time of day, weather proxy, road type | per trip / per segment |
sim_closed_loop_results | scenario id, near-miss flags, collision flags, TTC distributions | per simulation run |
Constraints: