Context
Waymo’s Mapping & Localization org ingests LiDAR point clouds from the entire autonomous vehicle fleet to power multiple downstream consumers: (1) offline HD map regeneration, (2) perception model training data selection, (3) incident investigation (“show me all drives near this intersection last Tuesday”), and (4) long-term fleet health analytics (sensor drift, calibration issues). Today, vehicles upload “drive bundles” to cloud storage when they return to depot Wi‑Fi. A nightly batch pipeline parses bundles and writes derived artifacts to a data lake. This architecture has two major problems: freshness (data becomes searchable 12–36 hours later) and operational fragility (retries and partial uploads create duplicates, missing segments, and inconsistent indexes).
You are asked to design a new system to ingest and index LiDAR data from the entire Waymo fleet such that it is queryable quickly and reliably. The system must support both near-real-time indexing (for safety investigations) and cost-efficient batch backfills (for reprocessing after algorithm updates). The platform is Google Cloud–based and must meet strict privacy and security requirements because LiDAR-derived products can be linked to precise geolocation.
Scale Requirements
- Fleet size: 25,000 vehicles globally
- Sensors: 1 roof LiDAR per vehicle, ~20 Hz frames
- Raw LiDAR payload: ~3–6 MB/frame compressed (varies by environment)
- Peak ingestion: 3,000 vehicles uploading concurrently at depot shift changes
- Daily volume: ~2–5 PB/day raw across the fleet (including other sensors, but LiDAR is the dominant contributor)
- Indexing latency SLO: 95% of frames searchable within 15 minutes of cloud arrival; 99% within 60 minutes
- Retention:
- Raw immutable LiDAR: 180 days (cold storage after 30 days)
- Derived “index + metadata”: 2+ years
- Aggregates (tile summaries, drive catalogs): indefinite
- Query patterns:
- “All frames within polygon + time range”
- “All drives that traversed road segment X”
- “Sample N frames with rain + night + speed>30 mph”
Data Characteristics
LiDAR arrives as drive sessions containing many segments (e.g., 1–5 minute chunks). Uploads are often partial (vehicle loses connectivity), can be out of order, and may be re-uploaded. Some frames are corrupted or missing timestamps.
Canonical metadata schema (logical)
| Entity | Key fields | Notes |
|---|
| Drive | drive_id, vehicle_id, start_ts, end_ts, region | drive_id is globally unique (UUID) |
| Segment | segment_id, drive_id, segment_start_ts, segment_end_ts, uri | segment_id unique within drive |
| Frame | frame_id, segment_id, frame_ts, pose, bytes, checksum | frame_ts used for time indexing |
| Index record | frame_id, geohash/H3, ts_bucket, uri, quality_flags | used for search + filtering |
Quality issues to handle: duplicates (same frame uploaded twice), late-arriving segments (hours/days), schema evolution (new fields), and “poison” files that crash parsers.
Requirements
Functional
- Ingest LiDAR drive bundles from vehicles into cloud storage with integrity checks (checksums) and provenance (vehicle firmware version, sensor calibration version).
- Parse and normalize raw bundles into a canonical frame/segment representation (columnar metadata + binary blobs).
- Index frames for fast retrieval by (geo, time) and common filters (weather, road type, speed, sensor health flags).
- Support late-arriving data and re-uploads without double-counting (idempotent processing).
- Provide a reprocessing/backfill mechanism when parsing/indexing logic changes (e.g., new pose correction), without breaking existing consumers.
- Expose a serving interface for investigators and ML data selection jobs to find and fetch frames quickly.
Non-functional
- Exactly-once effects for index writes (or clearly defined at-least-once with dedupe guarantees).
- Security & compliance: encryption at rest/in transit, IAM least privilege, audit logs; support deletion of specific drives if required by policy.
- Cost controls: separate hot vs cold tiers; avoid scanning PBs for common queries.
- Observability: end-to-end latency, completeness, and data quality metrics with actionable alerts.
Constraints
- Cloud: GCP (GCS, Pub/Sub, Dataflow/Spark, BigQuery, Bigtable/Spanner are available). Assume existing GKE for services.
- Team: 6 data engineers, 2 SREs; strong Spark skills, moderate streaming experience.
- Budget: must keep incremental platform spend under $400k/month at stated scale.
- Vehicles upload over unreliable networks; cannot require always-on connectivity.
Your Task (what the candidate should design)
Propose an end-to-end architecture (stream + batch) that:
- lands raw data durably,
- produces canonical metadata tables,
- builds a geo-time index suitable for interactive search,
- supports backfills and schema evolution,
- includes monitoring, alerting, and failure recovery.
Be explicit about partitioning strategy, watermarking/late data handling, idempotency keys, and how you would validate completeness (e.g., “did we receive all segments for drive_id?”). Discuss trade-offs between BigQuery vs Bigtable/Spanner for the index, and how you would keep investigator queries fast without exploding costs.