Ingest and Index Fleet LiDAR

Context

Waymo’s Mapping & Localization org ingests LiDAR point clouds from the entire autonomous vehicle fleet to power multiple downstream consumers: (1) offline HD map regeneration, (2) perception model training data selection, (3) incident investigation (“show me all drives near this intersection last Tuesday”), and (4) long-term fleet health analytics (sensor drift, calibration issues). Today, vehicles upload “drive bundles” to cloud storage when they return to depot Wi‑Fi. A nightly batch pipeline parses bundles and writes derived artifacts to a data lake. This architecture has two major problems: freshness (data becomes searchable 12–36 hours later) and operational fragility (retries and partial uploads create duplicates, missing segments, and inconsistent indexes).

You are asked to design a new system to ingest and index LiDAR data from the entire Waymo fleet such that it is queryable quickly and reliably. The system must support both near-real-time indexing (for safety investigations) and cost-efficient batch backfills (for reprocessing after algorithm updates). The platform is Google Cloud–based and must meet strict privacy and security requirements because LiDAR-derived products can be linked to precise geolocation.

Scale Requirements

Fleet size: 25,000 vehicles globally
Sensors: 1 roof LiDAR per vehicle, ~20 Hz frames
Raw LiDAR payload: ~3–6 MB/frame compressed (varies by environment)
Peak ingestion: 3,000 vehicles uploading concurrently at depot shift changes
Daily volume: ~2–5 PB/day raw across the fleet (including other sensors, but LiDAR is the dominant contributor)
Indexing latency SLO: 95% of frames searchable within 15 minutes of cloud arrival; 99% within 60 minutes
Retention:
- Raw immutable LiDAR: 180 days (cold storage after 30 days)
- Derived “index + metadata”: 2+ years
- Aggregates (tile summaries, drive catalogs): indefinite
Query patterns:
- “All frames within polygon + time range”
- “All drives that traversed road segment X”
- “Sample N frames with rain + night + speed>30 mph”

Data Characteristics

LiDAR arrives as drive sessions containing many segments (e.g., 1–5 minute chunks). Uploads are often partial (vehicle loses connectivity), can be out of order, and may be re-uploaded. Some frames are corrupted or missing timestamps.

Canonical metadata schema (logical)

Entity	Key fields	Notes
Drive	drive_id, vehicle_id, start_ts, end_ts, region	drive_id is globally unique (UUID)
Segment	segment_id, drive_id, segment_start_ts, segment_end_ts, uri	segment_id unique within drive
Frame	frame_id, segment_id, frame_ts, pose, bytes, checksum	frame_ts used for time indexing
Index record	frame_id, geohash/H3, ts_bucket, uri, quality_flags	used for search + filtering

Quality issues to handle: duplicates (same frame uploaded twice), late-arriving segments (hours/days), schema evolution (new fields), and “poison” files that crash parsers.

Requirements

Functional

Ingest LiDAR drive bundles from vehicles into cloud storage with integrity checks (checksums) and provenance (vehicle firmware version, sensor calibration version).
Parse and normalize raw bundles into a canonical frame/segment representation (columnar metadata + binary blobs).
Index frames for fast retrieval by (geo, time) and common filters (weather, road type, speed, sensor health flags).
Support late-arriving data and re-uploads without double-counting (idempotent processing).
Provide a reprocessing/backfill mechanism when parsing/indexing logic changes (e.g., new pose correction), without breaking existing consumers.
Expose a serving interface for investigators and ML data selection jobs to find and fetch frames quickly.

Non-functional

Exactly-once effects for index writes (or clearly defined at-least-once with dedupe guarantees).
Security & compliance: encryption at rest/in transit, IAM least privilege, audit logs; support deletion of specific drives if required by policy.
Cost controls: separate hot vs cold tiers; avoid scanning PBs for common queries.
Observability: end-to-end latency, completeness, and data quality metrics with actionable alerts.

Constraints

Cloud: GCP (GCS, Pub/Sub, Dataflow/Spark, BigQuery, Bigtable/Spanner are available). Assume existing GKE for services.
Team: 6 data engineers, 2 SREs; strong Spark skills, moderate streaming experience.
Budget: must keep incremental platform spend under $400k/month at stated scale.
Vehicles upload over unreliable networks; cannot require always-on connectivity.

Your Task (what the candidate should design)

Propose an end-to-end architecture (stream + batch) that:

lands raw data durably,
produces canonical metadata tables,
builds a geo-time index suitable for interactive search,
supports backfills and schema evolution,
includes monitoring, alerting, and failure recovery.

Be explicit about partitioning strategy, watermarking/late data handling, idempotency keys, and how you would validate completeness (e.g., “did we receive all segments for drive_id?”). Discuss trade-offs between BigQuery vs Bigtable/Spanner for the index, and how you would keep investigator queries fast without exploding costs.

Context

Scale Requirements

Fleet size: 25,000 vehicles globally
Sensors: 1 roof LiDAR per vehicle, ~20 Hz frames
Raw LiDAR payload: ~3–6 MB/frame compressed (varies by environment)
Peak ingestion: 3,000 vehicles uploading concurrently at depot shift changes
Daily volume: ~2–5 PB/day raw across the fleet (including other sensors, but LiDAR is the dominant contributor)
Indexing latency SLO: 95% of frames searchable within 15 minutes of cloud arrival; 99% within 60 minutes
Retention:
- Raw immutable LiDAR: 180 days (cold storage after 30 days)
- Derived “index + metadata”: 2+ years
- Aggregates (tile summaries, drive catalogs): indefinite
Query patterns:
- “All frames within polygon + time range”
- “All drives that traversed road segment X”
- “Sample N frames with rain + night + speed>30 mph”

Data Characteristics

Canonical metadata schema (logical)

Entity	Key fields	Notes
Drive	drive_id, vehicle_id, start_ts, end_ts, region	drive_id is globally unique (UUID)
Segment	segment_id, drive_id, segment_start_ts, segment_end_ts, uri	segment_id unique within drive
Frame	frame_id, segment_id, frame_ts, pose, bytes, checksum	frame_ts used for time indexing
Index record	frame_id, geohash/H3, ts_bucket, uri, quality_flags	used for search + filtering

Quality issues to handle: duplicates (same frame uploaded twice), late-arriving segments (hours/days), schema evolution (new fields), and “poison” files that crash parsers.

Requirements

Functional

Ingest LiDAR drive bundles from vehicles into cloud storage with integrity checks (checksums) and provenance (vehicle firmware version, sensor calibration version).
Parse and normalize raw bundles into a canonical frame/segment representation (columnar metadata + binary blobs).
Index frames for fast retrieval by (geo, time) and common filters (weather, road type, speed, sensor health flags).
Support late-arriving data and re-uploads without double-counting (idempotent processing).
Provide a reprocessing/backfill mechanism when parsing/indexing logic changes (e.g., new pose correction), without breaking existing consumers.
Expose a serving interface for investigators and ML data selection jobs to find and fetch frames quickly.

Non-functional

Exactly-once effects for index writes (or clearly defined at-least-once with dedupe guarantees).
Security & compliance: encryption at rest/in transit, IAM least privilege, audit logs; support deletion of specific drives if required by policy.
Cost controls: separate hot vs cold tiers; avoid scanning PBs for common queries.
Observability: end-to-end latency, completeness, and data quality metrics with actionable alerts.

Constraints

Cloud: GCP (GCS, Pub/Sub, Dataflow/Spark, BigQuery, Bigtable/Spanner are available). Assume existing GKE for services.
Team: 6 data engineers, 2 SREs; strong Spark skills, moderate streaming experience.
Budget: must keep incremental platform spend under $400k/month at stated scale.
Vehicles upload over unreliable networks; cannot require always-on connectivity.

Your Task (what the candidate should design)

Propose an end-to-end architecture (stream + batch) that:

lands raw data durably,
produces canonical metadata tables,
builds a geo-time index suitable for interactive search,
supports backfills and schema evolution,
includes monitoring, alerting, and failure recovery.

Context

Scale Requirements

Fleet size: 25,000 vehicles globally
Sensors: 1 roof LiDAR per vehicle, ~20 Hz frames
Raw LiDAR payload: ~3–6 MB/frame compressed (varies by environment)
Peak ingestion: 3,000 vehicles uploading concurrently at depot shift changes
Daily volume: ~2–5 PB/day raw across the fleet (including other sensors, but LiDAR is the dominant contributor)
Indexing latency SLO: 95% of frames searchable within 15 minutes of cloud arrival; 99% within 60 minutes
Retention:
- Raw immutable LiDAR: 180 days (cold storage after 30 days)
- Derived “index + metadata”: 2+ years
- Aggregates (tile summaries, drive catalogs): indefinite
Query patterns:
- “All frames within polygon + time range”
- “All drives that traversed road segment X”
- “Sample N frames with rain + night + speed>30 mph”

Data Characteristics

Canonical metadata schema (logical)

Entity	Key fields	Notes
Drive	drive_id, vehicle_id, start_ts, end_ts, region	drive_id is globally unique (UUID)
Segment	segment_id, drive_id, segment_start_ts, segment_end_ts, uri	segment_id unique within drive
Frame	frame_id, segment_id, frame_ts, pose, bytes, checksum	frame_ts used for time indexing
Index record	frame_id, geohash/H3, ts_bucket, uri, quality_flags	used for search + filtering

Quality issues to handle: duplicates (same frame uploaded twice), late-arriving segments (hours/days), schema evolution (new fields), and “poison” files that crash parsers.

Requirements

Functional

Ingest LiDAR drive bundles from vehicles into cloud storage with integrity checks (checksums) and provenance (vehicle firmware version, sensor calibration version).
Parse and normalize raw bundles into a canonical frame/segment representation (columnar metadata + binary blobs).
Index frames for fast retrieval by (geo, time) and common filters (weather, road type, speed, sensor health flags).
Support late-arriving data and re-uploads without double-counting (idempotent processing).
Provide a reprocessing/backfill mechanism when parsing/indexing logic changes (e.g., new pose correction), without breaking existing consumers.
Expose a serving interface for investigators and ML data selection jobs to find and fetch frames quickly.

Non-functional

Exactly-once effects for index writes (or clearly defined at-least-once with dedupe guarantees).
Security & compliance: encryption at rest/in transit, IAM least privilege, audit logs; support deletion of specific drives if required by policy.
Cost controls: separate hot vs cold tiers; avoid scanning PBs for common queries.
Observability: end-to-end latency, completeness, and data quality metrics with actionable alerts.

Constraints

Cloud: GCP (GCS, Pub/Sub, Dataflow/Spark, BigQuery, Bigtable/Spanner are available). Assume existing GKE for services.
Team: 6 data engineers, 2 SREs; strong Spark skills, moderate streaming experience.
Budget: must keep incremental platform spend under $400k/month at stated scale.
Vehicles upload over unreliable networks; cannot require always-on connectivity.

Your Task (what the candidate should design)

Propose an end-to-end architecture (stream + batch) that:

lands raw data durably,
produces canonical metadata tables,
builds a geo-time index suitable for interactive search,
supports backfills and schema evolution,
includes monitoring, alerting, and failure recovery.

Context

Scale Requirements

Fleet size: 25,000 vehicles globally
Sensors: 1 roof LiDAR per vehicle, ~20 Hz frames
Raw LiDAR payload: ~3–6 MB/frame compressed (varies by environment)
Peak ingestion: 3,000 vehicles uploading concurrently at depot shift changes
Daily volume: ~2–5 PB/day raw across the fleet (including other sensors, but LiDAR is the dominant contributor)
Indexing latency SLO: 95% of frames searchable within 15 minutes of cloud arrival; 99% within 60 minutes
Retention:
- Raw immutable LiDAR: 180 days (cold storage after 30 days)
- Derived “index + metadata”: 2+ years
- Aggregates (tile summaries, drive catalogs): indefinite
Query patterns:
- “All frames within polygon + time range”
- “All drives that traversed road segment X”
- “Sample N frames with rain + night + speed>30 mph”

Data Characteristics

Canonical metadata schema (logical)

Entity	Key fields	Notes
Drive	drive_id, vehicle_id, start_ts, end_ts, region	drive_id is globally unique (UUID)
Segment	segment_id, drive_id, segment_start_ts, segment_end_ts, uri	segment_id unique within drive
Frame	frame_id, segment_id, frame_ts, pose, bytes, checksum	frame_ts used for time indexing
Index record	frame_id, geohash/H3, ts_bucket, uri, quality_flags	used for search + filtering

Quality issues to handle: duplicates (same frame uploaded twice), late-arriving segments (hours/days), schema evolution (new fields), and “poison” files that crash parsers.

Requirements

Functional

Ingest LiDAR drive bundles from vehicles into cloud storage with integrity checks (checksums) and provenance (vehicle firmware version, sensor calibration version).
Parse and normalize raw bundles into a canonical frame/segment representation (columnar metadata + binary blobs).
Index frames for fast retrieval by (geo, time) and common filters (weather, road type, speed, sensor health flags).
Support late-arriving data and re-uploads without double-counting (idempotent processing).
Provide a reprocessing/backfill mechanism when parsing/indexing logic changes (e.g., new pose correction), without breaking existing consumers.
Expose a serving interface for investigators and ML data selection jobs to find and fetch frames quickly.

Non-functional

Exactly-once effects for index writes (or clearly defined at-least-once with dedupe guarantees).
Security & compliance: encryption at rest/in transit, IAM least privilege, audit logs; support deletion of specific drives if required by policy.
Cost controls: separate hot vs cold tiers; avoid scanning PBs for common queries.
Observability: end-to-end latency, completeness, and data quality metrics with actionable alerts.

Constraints

Cloud: GCP (GCS, Pub/Sub, Dataflow/Spark, BigQuery, Bigtable/Spanner are available). Assume existing GKE for services.
Team: 6 data engineers, 2 SREs; strong Spark skills, moderate streaming experience.
Budget: must keep incremental platform spend under $400k/month at stated scale.
Vehicles upload over unreliable networks; cannot require always-on connectivity.

Your Task (what the candidate should design)

Propose an end-to-end architecture (stream + batch) that:

lands raw data durably,
produces canonical metadata tables,
builds a geo-time index suitable for interactive search,
supports backfills and schema evolution,
includes monitoring, alerting, and failure recovery.

Interview Guides

Context

Scale Requirements

Data Characteristics

Canonical metadata schema (logical)

Requirements

Functional

Non-functional

Constraints

Your Task (what the candidate should design)

Ingest and Index Fleet LiDAR

Context

Scale Requirements

Data Characteristics

Canonical metadata schema (logical)

Requirements

Functional

Non-functional

Constraints

Your Task (what the candidate should design)

Your Answer

Ingest and Index Fleet LiDAR

Context

Scale Requirements

Data Characteristics

Canonical metadata schema (logical)

Requirements

Functional

Non-functional

Constraints

Your Task (what the candidate should design)

Ingest and Index Fleet LiDAR

Context

Scale Requirements

Data Characteristics

Canonical metadata schema (logical)

Requirements

Functional

Non-functional

Constraints

Your Task (what the candidate should design)

Your Answer