Business Context
You’re on the Perception & Prediction team at MetroDrive, a robotaxi service operating in San Francisco and Phoenix. The fleet drives 2.5M autonomous miles/week. A disproportionate share of safety-critical disengagements and hard-braking events come from pedestrian interactions near crosswalks and curb edges. Product and Safety want a model that can anticipate pedestrian behavior 1–3 seconds ahead so the planner can slow early, reduce jerk, and avoid last-moment emergency braking.
The model will run on-vehicle (embedded GPU/CPU) and must be robust to occlusions, dense urban scenes, and distribution shift (new intersections, events, weather). False negatives (predict “won’t cross” when they do) are safety-critical; false positives increase unnecessary yielding and hurt ride time.
Dataset
You have logged data from the perception stack (already fused and tracked). Each example corresponds to a pedestrian track segment aligned to the ego vehicle timeline.
| Component | Scale / Shape | Examples / Notes |
|---|
| Track sequences | 1.8M sequences | 20 Hz, variable length 1–6s (padded/truncated) |
| Kinematics (per timestep) | 12 floats | x/y in ego frame, vx/vy, ax/ay, heading, yaw_rate |
| Scene context (static) | 18 floats | distance to nearest crosswalk, curb, sidewalk; lane count; speed limit |
| Interaction features | 10 floats | TTC to ego path, relative bearing, ego speed/accel, gap to nearest vehicle |
| Semantics | 6 categorical | intersection type, crosswalk present, signal state (if available), time-of-day bucket |
| Labels | binary + time-to-event | whether pedestrian enters ego lane/crosswalk within horizon; time-to-cross (if positive) |
Additional characteristics:
- Positive rate: ~7% of sequences result in a crossing within 3s (highly imbalanced).
- Missingness: ~12% of timesteps have partial occlusion (noisy velocity/accel); signal state missing in ~30% of scenes.
- Leakage risk: tracks may include frames after the pedestrian has already stepped off the curb if you’re not careful with label alignment.
Success Criteria
- Safety-first recall: At precision ≥ 0.35, achieve recall ≥ 0.85 for “cross within 3s”.
- Calibrated probabilities: Expected Calibration Error (ECE) ≤ 0.03 so the planner can use probabilities as costs.
- Timeliness: Median time-to-detection (first time model crosses alert threshold before crossing) ≥ 1.0s.
- On-vehicle performance: p95 inference latency ≤ 20 ms per tracked pedestrian on target hardware (e.g., NVIDIA Orin), with up to 40 concurrent tracks.
Constraints
- Real-time: streaming inference at 20 Hz; must support batching across tracks.
- Interpretability: safety review requires feature/attribution summaries (e.g., “distance-to-curb decreasing + heading toward crosswalk”).
- Robustness: handle occlusion and sensor noise; avoid brittle dependence on a single feature like signal state.
- Evaluation: must avoid temporal leakage; split by geography + time (hold out intersections and weeks).
Deliverables
- Define the prediction target(s): classification horizon(s) (1s/2s/3s) and optional time-to-cross regression.
- Propose a model architecture and training objective that handles imbalance and produces calibrated probabilities.
- Specify feature engineering for sequences and context, including handling missing/occlusion.
- Describe the train/val/test split and cross-validation strategy to prevent leakage.
- Provide an evaluation plan with metrics aligned to safety and planning (PR-AUC, recall@precision, calibration, time-to-detection).
- Outline a production deployment plan: streaming inference, monitoring, retraining cadence, and rollback triggers.