Context
You’re interviewing with the Autonomy Platform team at Waymo-like autonomous driving company that runs large-scale simulation to validate perception, prediction, and planning models before on-road testing. The simulation fleet replays real-world traffic scenarios (e.g., unprotected left turns, emergency vehicle interactions, pedestrian jaywalking) to reproduce bugs and measure regression risk.
Today, the company logs vehicle telemetry and perception outputs from 15,000 test vehicles across Phoenix, SF, and Austin. Data lands in a data lake (GCS) and is later curated into “scenarios” by offline analysts. The current workflow is slow and non-deterministic: scenario extraction is batch-only (24–48 hour lag), and replay results can differ run-to-run due to inconsistent event ordering, schema drift, and missing late-arriving sensor packets. This has caused safety-critical regressions to slip into weekly model releases and has created compliance risk because scenario provenance is not consistently auditable.
The VP of Engineering wants a simulation replay engine that allows developers to specify a scenario (by time range, geo-fence, vehicle IDs, or a known incident ID), and then replay it deterministically into a simulator cluster. The system must support both real-time “nearline” scenario availability for rapid debugging and large-scale backfills for offline evaluation.
Scale Requirements
- Ingestion throughput: peak 2,000,000 events/sec across all vehicles (bursty during dense urban driving)
- Event size: avg 1.5 KB (protobuf), with occasional 50–200 KB blobs (compressed point clouds referenced by URI)
- Daily volume: ~20 TB/day raw telemetry + 200 TB/day referenced sensor blobs (stored separately)
- Scenario query latency: developers can locate and start replay of an indexed scenario in < 30 seconds
- Replay control latency: pause/seek/speed changes applied in < 2 seconds
- Determinism: identical inputs must yield byte-identical ordered event streams for a given scenario version
- Retention: raw events 180 days, curated scenario manifests 2 years
Data Characteristics
Event types (examples)
| Stream | Key fields | Notes |
|---|
vehicle_pose | vehicle_id, event_time, seq_no, x,y,z, heading | High frequency (50–100 Hz) |
perception_objects | vehicle_id, event_time, frame_id, objects[] | Medium frequency (10–20 Hz), schema evolves |
control_commands | vehicle_id, event_time, seq_no, steer, brake, throttle | Must be strictly ordered |
incident_markers | incident_id, vehicle_id, event_time, type, metadata | Low frequency, used for indexing |
Quality issues you must handle
- Late-arriving packets: up to 15 minutes late due to connectivity gaps
- Duplicates: retries from edge gateways cause duplicate seq numbers
- Out-of-order events: cross-topic ordering is not guaranteed
- Schema evolution: perception payload adds/removes fields monthly
- Partial data: some sensors missing for a window; replay must either fail fast or degrade explicitly
Your Task
Design the end-to-end data pipeline and orchestration to support scenario extraction, indexing, and deterministic replay.
Functional requirements
- Ingest telemetry streams from vehicles into a durable log with replay capability.
- Normalize and validate events (schema validation, required fields, unit normalization).
- Deduplicate and order events deterministically per scenario (define ordering rules).
- Build scenario manifests: a manifest describes scenario boundaries, required streams, blob URIs, and a deterministic ordering/offset mapping.
- Support developer APIs:
- Find scenarios by
incident_id, time range, geo-fence, tags (e.g., “unprotected_left”), and vehicle IDs.
- Start replay at normal speed, 2x/0.5x, pause/resume, and seek to timestamp.
- Backfill: regenerate manifests for historical data and re-run scenario extraction logic when rules change.
- Auditability: every replay run must be traceable to exact input offsets, schema versions, and extraction code version.
Non-functional requirements
- Exactly-once or effectively-once semantics for manifest generation (no double-counting frames).
- High availability: tolerate zone loss without losing committed events.
- Cost controls: keep incremental platform cost under $150K/month (compute + storage + warehouse).
- Security/compliance: access controls by team; immutable audit logs for safety investigations.
Constraints
- Cloud: GCP (GCS, BigQuery) already standardized.
- Existing streaming: Kafka is used in some teams; platform is open to Confluent Cloud or self-managed.
- Processing: team has strong Spark expertise; limited Flink experience.
- Orchestration: Airflow 2.x is the standard scheduler.
- Data modeling: analytics uses dbt in BigQuery.
What we’re evaluating
In your design, be explicit about:
- Topic design and partitioning strategy (keys, ordering guarantees, retention)
- Watermarking strategy for late data and how it affects determinism
- Manifest data model (what tables/files exist, partitioning, and versioning)
- How replay reads are served (from Kafka offsets vs lake files) and how you guarantee consistent ordering across streams
- Backfill strategy that doesn’t overload production ingestion
- Monitoring/alerting and operational playbooks for common failures
Deliverables: architecture diagram, data flow steps, tech stack choices, example code for streaming processing + orchestration/backfill, and production monitoring/failure handling.