Context
WaymoScale (autonomous vehicle + mapping) runs a fleet of 18,000 test vehicles across California, Texas, and Arizona. Each vehicle records multi-modal “driving logs” used by the Quantitative Evaluation (QE) team to score new perception/planning models before they’re deployed to the fleet. A single drive produces camera frames, LiDAR point clouds, radar sweeps, CAN bus signals, GPS/IMU, and model inference outputs. QE computes metrics like collision-risk proxies, lane-keeping violations, disengagement root causes, and scenario coverage.
Today, logs are uploaded opportunistically over cellular/Wi‑Fi to a cloud object store. A set of daily Spark jobs aggregates the data into a Hive metastore and exports summary CSVs to analysts. This architecture is failing: (1) freshness is 24–48 hours, so safety regressions are detected late; (2) late-arriving uploads (vehicles offline for days) cause metric backfills that are manual and error-prone; (3) schema drift across vehicle software versions breaks jobs; and (4) compute costs spike because backfills reprocess entire days of data.
You are the senior data engineer tasked with designing a production-grade pipeline that supports both near-real-time monitoring and reproducible batch evaluation for model releases.
Scale Requirements
- Fleet event rate: avg 2.5 GB/min/vehicle raw sensor payload (compressed), peaks during dense urban routes
- Ingest volume: 8–12 PB/day raw across all modalities; ~1.5 PB/day after selecting QE-relevant signals
- File sizes: 64MB–2GB objects; mixed small metadata events and large binary chunks
- Latency targets:
- Safety monitoring aggregates: P95 < 15 minutes from upload to queryable metrics
- Full QE batch for a model candidate: < 6 hours for a 24-hour slice
- Retention:
- Raw immutable logs: 18 months
- Curated “QE-ready” tables: 24 months
- Aggregates: indefinite
Data Characteristics
Modalities and schemas
| Modality | Example fields | Notes |
|---|
| Drive metadata | drive_id, vehicle_id, sw_version, route_id, start_ts, end_ts | Small, high-value for joins |
| Events stream | ts, event_type, severity, subsystem, payload_json | Many event types; schema evolves |
| Sensor chunks | camera_uri, lidar_uri, radar_uri, chunk_start_ts, chunk_end_ts | Large binaries; referenced by URI |
| Model outputs | ts, model_version, object_tracks, planner_decisions | Needed for offline scoring |
Quality issues you must handle
- Late arrivals: uploads can arrive up to 7 days late; partial uploads are common.
- Duplicates: retry logic can upload the same chunk multiple times.
- Out-of-order timestamps: device clock skew; some sensors drift.
- Schema evolution: new event types weekly; fields added/renamed across software versions.
- Corruption: truncated binary chunks; malformed JSON payloads.
Requirements
Functional
- Ingest driving logs from vehicles via an upload service into a durable landing zone.
- Support two paths:
- Streaming/near-real-time aggregates for safety monitoring.
- Batch reproducible QE computations for model release gates.
- Build a curated lakehouse layout:
- Bronze (raw immutable), Silver (validated + deduped + normalized), Gold (metrics + scenario tables).
- Implement idempotent processing and deterministic re-runs for a given (drive_id, model_version).
- Handle late-arriving data with automated backfill and incremental recomputation of affected aggregates.
- Provide data contracts and schema evolution strategy across producers.
- Enable analysts and ML engineers to query metrics in a warehouse with strong performance for time-window and scenario filters.
Non-functional
- Reliability: 99.9% pipeline availability; no silent data loss.
- Cost: keep incremental cloud spend under $250K/month (compute + storage + warehouse).
- Security/compliance: driving logs may contain PII (faces/plates). Enforce encryption, access controls, and audit logs. Support deletion requests within 30 days.
- Observability: end-to-end lineage, freshness SLAs, and data quality reporting.
Constraints
- Cloud: AWS (S3, IAM, KMS) is mandated.
- Existing skills: team is strong in Spark + Airflow + dbt, moderate in Kafka.
- Warehouse: company standard is Snowflake.
- Upload service already emits metadata events; you may extend it but cannot replace it.
What you should produce in your answer
- A target architecture (components + responsibilities) and how data flows end-to-end.
- Partitioning and data modeling choices for Bronze/Silver/Gold.
- Strategy for late data, deduplication, and incremental recomputation.
- Orchestration plan (Airflow DAGs, backfill strategy, dependency management).
- Data quality framework (checks, quarantine, SLAs) and schema evolution approach.
- Monitoring/alerting and failure recovery playbooks.
- Performance and cost optimization techniques for Spark and Snowflake.
Your design should be detailed enough that another engineer could start implementing it, including at least one streaming job and one dbt model pattern.