Petabyte Driving Logs Evaluation Pipeline

Context

WaymoScale (autonomous vehicle + mapping) runs a fleet of 18,000 test vehicles across California, Texas, and Arizona. Each vehicle records multi-modal “driving logs” used by the Quantitative Evaluation (QE) team to score new perception/planning models before they’re deployed to the fleet. A single drive produces camera frames, LiDAR point clouds, radar sweeps, CAN bus signals, GPS/IMU, and model inference outputs. QE computes metrics like collision-risk proxies, lane-keeping violations, disengagement root causes, and scenario coverage.

Today, logs are uploaded opportunistically over cellular/Wi‑Fi to a cloud object store. A set of daily Spark jobs aggregates the data into a Hive metastore and exports summary CSVs to analysts. This architecture is failing: (1) freshness is 24–48 hours, so safety regressions are detected late; (2) late-arriving uploads (vehicles offline for days) cause metric backfills that are manual and error-prone; (3) schema drift across vehicle software versions breaks jobs; and (4) compute costs spike because backfills reprocess entire days of data.

You are the senior data engineer tasked with designing a production-grade pipeline that supports both near-real-time monitoring and reproducible batch evaluation for model releases.

Scale Requirements

Fleet event rate: avg 2.5 GB/min/vehicle raw sensor payload (compressed), peaks during dense urban routes
Ingest volume: 8–12 PB/day raw across all modalities; ~1.5 PB/day after selecting QE-relevant signals
File sizes: 64MB–2GB objects; mixed small metadata events and large binary chunks
Latency targets:
- Safety monitoring aggregates: P95 < 15 minutes from upload to queryable metrics
- Full QE batch for a model candidate: < 6 hours for a 24-hour slice
Retention:
- Raw immutable logs: 18 months
- Curated “QE-ready” tables: 24 months
- Aggregates: indefinite

Data Characteristics

Modalities and schemas

Modality	Example fields	Notes
Drive metadata	drive_id, vehicle_id, sw_version, route_id, start_ts, end_ts	Small, high-value for joins
Events stream	ts, event_type, severity, subsystem, payload_json	Many event types; schema evolves
Sensor chunks	camera_uri, lidar_uri, radar_uri, chunk_start_ts, chunk_end_ts	Large binaries; referenced by URI
Model outputs	ts, model_version, object_tracks, planner_decisions	Needed for offline scoring

Quality issues you must handle

Late arrivals: uploads can arrive up to 7 days late; partial uploads are common.
Duplicates: retry logic can upload the same chunk multiple times.
Out-of-order timestamps: device clock skew; some sensors drift.
Schema evolution: new event types weekly; fields added/renamed across software versions.
Corruption: truncated binary chunks; malformed JSON payloads.

Requirements

Functional

Ingest driving logs from vehicles via an upload service into a durable landing zone.
Support two paths:
- Streaming/near-real-time aggregates for safety monitoring.
- Batch reproducible QE computations for model release gates.
Build a curated lakehouse layout:
- Bronze (raw immutable), Silver (validated + deduped + normalized), Gold (metrics + scenario tables).
Implement idempotent processing and deterministic re-runs for a given (drive_id, model_version).
Handle late-arriving data with automated backfill and incremental recomputation of affected aggregates.
Provide data contracts and schema evolution strategy across producers.
Enable analysts and ML engineers to query metrics in a warehouse with strong performance for time-window and scenario filters.

Non-functional

Reliability: 99.9% pipeline availability; no silent data loss.
Cost: keep incremental cloud spend under $250K/month (compute + storage + warehouse).
Security/compliance: driving logs may contain PII (faces/plates). Enforce encryption, access controls, and audit logs. Support deletion requests within 30 days.
Observability: end-to-end lineage, freshness SLAs, and data quality reporting.

Constraints

Cloud: AWS (S3, IAM, KMS) is mandated.
Existing skills: team is strong in Spark + Airflow + dbt, moderate in Kafka.
Warehouse: company standard is Snowflake.
Upload service already emits metadata events; you may extend it but cannot replace it.

What you should produce in your answer

A target architecture (components + responsibilities) and how data flows end-to-end.
Partitioning and data modeling choices for Bronze/Silver/Gold.
Strategy for late data, deduplication, and incremental recomputation.
Orchestration plan (Airflow DAGs, backfill strategy, dependency management).
Data quality framework (checks, quarantine, SLAs) and schema evolution approach.
Monitoring/alerting and failure recovery playbooks.
Performance and cost optimization techniques for Spark and Snowflake.

Your design should be detailed enough that another engineer could start implementing it, including at least one streaming job and one dbt model pattern.

Context

You are the senior data engineer tasked with designing a production-grade pipeline that supports both near-real-time monitoring and reproducible batch evaluation for model releases.

Scale Requirements

Fleet event rate: avg 2.5 GB/min/vehicle raw sensor payload (compressed), peaks during dense urban routes
Ingest volume: 8–12 PB/day raw across all modalities; ~1.5 PB/day after selecting QE-relevant signals
File sizes: 64MB–2GB objects; mixed small metadata events and large binary chunks
Latency targets:
- Safety monitoring aggregates: P95 < 15 minutes from upload to queryable metrics
- Full QE batch for a model candidate: < 6 hours for a 24-hour slice
Retention:
- Raw immutable logs: 18 months
- Curated “QE-ready” tables: 24 months
- Aggregates: indefinite

Data Characteristics

Modalities and schemas

Modality	Example fields	Notes
Drive metadata	drive_id, vehicle_id, sw_version, route_id, start_ts, end_ts	Small, high-value for joins
Events stream	ts, event_type, severity, subsystem, payload_json	Many event types; schema evolves
Sensor chunks	camera_uri, lidar_uri, radar_uri, chunk_start_ts, chunk_end_ts	Large binaries; referenced by URI
Model outputs	ts, model_version, object_tracks, planner_decisions	Needed for offline scoring

Quality issues you must handle

Late arrivals: uploads can arrive up to 7 days late; partial uploads are common.
Duplicates: retry logic can upload the same chunk multiple times.
Out-of-order timestamps: device clock skew; some sensors drift.
Schema evolution: new event types weekly; fields added/renamed across software versions.
Corruption: truncated binary chunks; malformed JSON payloads.

Requirements

Functional

Ingest driving logs from vehicles via an upload service into a durable landing zone.
Support two paths:
- Streaming/near-real-time aggregates for safety monitoring.
- Batch reproducible QE computations for model release gates.
Build a curated lakehouse layout:
- Bronze (raw immutable), Silver (validated + deduped + normalized), Gold (metrics + scenario tables).
Implement idempotent processing and deterministic re-runs for a given (drive_id, model_version).
Handle late-arriving data with automated backfill and incremental recomputation of affected aggregates.
Provide data contracts and schema evolution strategy across producers.
Enable analysts and ML engineers to query metrics in a warehouse with strong performance for time-window and scenario filters.

Non-functional

Reliability: 99.9% pipeline availability; no silent data loss.
Cost: keep incremental cloud spend under $250K/month (compute + storage + warehouse).
Security/compliance: driving logs may contain PII (faces/plates). Enforce encryption, access controls, and audit logs. Support deletion requests within 30 days.
Observability: end-to-end lineage, freshness SLAs, and data quality reporting.

Constraints

Cloud: AWS (S3, IAM, KMS) is mandated.
Existing skills: team is strong in Spark + Airflow + dbt, moderate in Kafka.
Warehouse: company standard is Snowflake.
Upload service already emits metadata events; you may extend it but cannot replace it.

What you should produce in your answer

A target architecture (components + responsibilities) and how data flows end-to-end.
Partitioning and data modeling choices for Bronze/Silver/Gold.
Strategy for late data, deduplication, and incremental recomputation.
Orchestration plan (Airflow DAGs, backfill strategy, dependency management).
Data quality framework (checks, quarantine, SLAs) and schema evolution approach.
Monitoring/alerting and failure recovery playbooks.
Performance and cost optimization techniques for Spark and Snowflake.

Your design should be detailed enough that another engineer could start implementing it, including at least one streaming job and one dbt model pattern.

Context

You are the senior data engineer tasked with designing a production-grade pipeline that supports both near-real-time monitoring and reproducible batch evaluation for model releases.

Scale Requirements

Fleet event rate: avg 2.5 GB/min/vehicle raw sensor payload (compressed), peaks during dense urban routes
Ingest volume: 8–12 PB/day raw across all modalities; ~1.5 PB/day after selecting QE-relevant signals
File sizes: 64MB–2GB objects; mixed small metadata events and large binary chunks
Latency targets:
- Safety monitoring aggregates: P95 < 15 minutes from upload to queryable metrics
- Full QE batch for a model candidate: < 6 hours for a 24-hour slice
Retention:
- Raw immutable logs: 18 months
- Curated “QE-ready” tables: 24 months
- Aggregates: indefinite

Data Characteristics

Modalities and schemas

Modality	Example fields	Notes
Drive metadata	drive_id, vehicle_id, sw_version, route_id, start_ts, end_ts	Small, high-value for joins
Events stream	ts, event_type, severity, subsystem, payload_json	Many event types; schema evolves
Sensor chunks	camera_uri, lidar_uri, radar_uri, chunk_start_ts, chunk_end_ts	Large binaries; referenced by URI
Model outputs	ts, model_version, object_tracks, planner_decisions	Needed for offline scoring

Quality issues you must handle

Late arrivals: uploads can arrive up to 7 days late; partial uploads are common.
Duplicates: retry logic can upload the same chunk multiple times.
Out-of-order timestamps: device clock skew; some sensors drift.
Schema evolution: new event types weekly; fields added/renamed across software versions.
Corruption: truncated binary chunks; malformed JSON payloads.

Requirements

Functional

Ingest driving logs from vehicles via an upload service into a durable landing zone.
Support two paths:
- Streaming/near-real-time aggregates for safety monitoring.
- Batch reproducible QE computations for model release gates.
Build a curated lakehouse layout:
- Bronze (raw immutable), Silver (validated + deduped + normalized), Gold (metrics + scenario tables).
Implement idempotent processing and deterministic re-runs for a given (drive_id, model_version).
Handle late-arriving data with automated backfill and incremental recomputation of affected aggregates.
Provide data contracts and schema evolution strategy across producers.
Enable analysts and ML engineers to query metrics in a warehouse with strong performance for time-window and scenario filters.

Non-functional

Reliability: 99.9% pipeline availability; no silent data loss.
Cost: keep incremental cloud spend under $250K/month (compute + storage + warehouse).
Security/compliance: driving logs may contain PII (faces/plates). Enforce encryption, access controls, and audit logs. Support deletion requests within 30 days.
Observability: end-to-end lineage, freshness SLAs, and data quality reporting.

Constraints

Cloud: AWS (S3, IAM, KMS) is mandated.
Existing skills: team is strong in Spark + Airflow + dbt, moderate in Kafka.
Warehouse: company standard is Snowflake.
Upload service already emits metadata events; you may extend it but cannot replace it.

What you should produce in your answer

A target architecture (components + responsibilities) and how data flows end-to-end.
Partitioning and data modeling choices for Bronze/Silver/Gold.
Strategy for late data, deduplication, and incremental recomputation.
Orchestration plan (Airflow DAGs, backfill strategy, dependency management).
Data quality framework (checks, quarantine, SLAs) and schema evolution approach.
Monitoring/alerting and failure recovery playbooks.
Performance and cost optimization techniques for Spark and Snowflake.

Your design should be detailed enough that another engineer could start implementing it, including at least one streaming job and one dbt model pattern.

Context

You are the senior data engineer tasked with designing a production-grade pipeline that supports both near-real-time monitoring and reproducible batch evaluation for model releases.

Scale Requirements

Fleet event rate: avg 2.5 GB/min/vehicle raw sensor payload (compressed), peaks during dense urban routes
Ingest volume: 8–12 PB/day raw across all modalities; ~1.5 PB/day after selecting QE-relevant signals
File sizes: 64MB–2GB objects; mixed small metadata events and large binary chunks
Latency targets:
- Safety monitoring aggregates: P95 < 15 minutes from upload to queryable metrics
- Full QE batch for a model candidate: < 6 hours for a 24-hour slice
Retention:
- Raw immutable logs: 18 months
- Curated “QE-ready” tables: 24 months
- Aggregates: indefinite

Data Characteristics

Modalities and schemas

Modality	Example fields	Notes
Drive metadata	drive_id, vehicle_id, sw_version, route_id, start_ts, end_ts	Small, high-value for joins
Events stream	ts, event_type, severity, subsystem, payload_json	Many event types; schema evolves
Sensor chunks	camera_uri, lidar_uri, radar_uri, chunk_start_ts, chunk_end_ts	Large binaries; referenced by URI
Model outputs	ts, model_version, object_tracks, planner_decisions	Needed for offline scoring

Quality issues you must handle

Late arrivals: uploads can arrive up to 7 days late; partial uploads are common.
Duplicates: retry logic can upload the same chunk multiple times.
Out-of-order timestamps: device clock skew; some sensors drift.
Schema evolution: new event types weekly; fields added/renamed across software versions.
Corruption: truncated binary chunks; malformed JSON payloads.

Requirements

Functional

Ingest driving logs from vehicles via an upload service into a durable landing zone.
Support two paths:
- Streaming/near-real-time aggregates for safety monitoring.
- Batch reproducible QE computations for model release gates.
Build a curated lakehouse layout:
- Bronze (raw immutable), Silver (validated + deduped + normalized), Gold (metrics + scenario tables).
Implement idempotent processing and deterministic re-runs for a given (drive_id, model_version).
Handle late-arriving data with automated backfill and incremental recomputation of affected aggregates.
Provide data contracts and schema evolution strategy across producers.
Enable analysts and ML engineers to query metrics in a warehouse with strong performance for time-window and scenario filters.

Non-functional

Reliability: 99.9% pipeline availability; no silent data loss.
Cost: keep incremental cloud spend under $250K/month (compute + storage + warehouse).
Security/compliance: driving logs may contain PII (faces/plates). Enforce encryption, access controls, and audit logs. Support deletion requests within 30 days.
Observability: end-to-end lineage, freshness SLAs, and data quality reporting.

Constraints

Cloud: AWS (S3, IAM, KMS) is mandated.
Existing skills: team is strong in Spark + Airflow + dbt, moderate in Kafka.
Warehouse: company standard is Snowflake.
Upload service already emits metadata events; you may extend it but cannot replace it.

What you should produce in your answer

A target architecture (components + responsibilities) and how data flows end-to-end.
Partitioning and data modeling choices for Bronze/Silver/Gold.
Strategy for late data, deduplication, and incremental recomputation.
Orchestration plan (Airflow DAGs, backfill strategy, dependency management).
Data quality framework (checks, quarantine, SLAs) and schema evolution approach.
Monitoring/alerting and failure recovery playbooks.
Performance and cost optimization techniques for Spark and Snowflake.

Your design should be detailed enough that another engineer could start implementing it, including at least one streaming job and one dbt model pattern.

Interview Guides

Context

Scale Requirements

Data Characteristics

Modalities and schemas

Quality issues you must handle

Requirements

Functional

Non-functional

Constraints

What you should produce in your answer

Petabyte Driving Logs Evaluation Pipeline

Context

Scale Requirements

Data Characteristics

Modalities and schemas

Quality issues you must handle

Requirements

Functional

Non-functional

Constraints

What you should produce in your answer

Your Answer

Petabyte Driving Logs Evaluation Pipeline

Context

Scale Requirements

Data Characteristics

Modalities and schemas

Quality issues you must handle

Requirements

Functional

Non-functional

Constraints

What you should produce in your answer

Petabyte Driving Logs Evaluation Pipeline

Context

Scale Requirements

Data Characteristics

Modalities and schemas

Quality issues you must handle

Requirements

Functional

Non-functional

Constraints

What you should produce in your answer

Your Answer