Replayable Traffic Scenario Simulation Pipeline

Context

You’re interviewing with the Autonomy Platform team at Waymo-like autonomous driving company that runs large-scale simulation to validate perception, prediction, and planning models before on-road testing. The simulation fleet replays real-world traffic scenarios (e.g., unprotected left turns, emergency vehicle interactions, pedestrian jaywalking) to reproduce bugs and measure regression risk.

Today, the company logs vehicle telemetry and perception outputs from 15,000 test vehicles across Phoenix, SF, and Austin. Data lands in a data lake (GCS) and is later curated into “scenarios” by offline analysts. The current workflow is slow and non-deterministic: scenario extraction is batch-only (24–48 hour lag), and replay results can differ run-to-run due to inconsistent event ordering, schema drift, and missing late-arriving sensor packets. This has caused safety-critical regressions to slip into weekly model releases and has created compliance risk because scenario provenance is not consistently auditable.

The VP of Engineering wants a simulation replay engine that allows developers to specify a scenario (by time range, geo-fence, vehicle IDs, or a known incident ID), and then replay it deterministically into a simulator cluster. The system must support both real-time “nearline” scenario availability for rapid debugging and large-scale backfills for offline evaluation.

Scale Requirements

Ingestion throughput: peak 2,000,000 events/sec across all vehicles (bursty during dense urban driving)
Event size: avg 1.5 KB (protobuf), with occasional 50–200 KB blobs (compressed point clouds referenced by URI)
Daily volume: ~20 TB/day raw telemetry + 200 TB/day referenced sensor blobs (stored separately)
Scenario query latency: developers can locate and start replay of an indexed scenario in < 30 seconds
Replay control latency: pause/seek/speed changes applied in < 2 seconds
Determinism: identical inputs must yield byte-identical ordered event streams for a given scenario version
Retention: raw events 180 days, curated scenario manifests 2 years

Data Characteristics

Event types (examples)

Stream	Key fields	Notes
`vehicle_pose`	`vehicle_id`, `event_time`, `seq_no`, `x,y,z`, `heading`	High frequency (50–100 Hz)
`perception_objects`	`vehicle_id`, `event_time`, `frame_id`, `objects[]`	Medium frequency (10–20 Hz), schema evolves
`control_commands`	`vehicle_id`, `event_time`, `seq_no`, `steer`, `brake`, `throttle`	Must be strictly ordered
`incident_markers`	`incident_id`, `vehicle_id`, `event_time`, `type`, `metadata`	Low frequency, used for indexing

Quality issues you must handle

Late-arriving packets: up to 15 minutes late due to connectivity gaps
Duplicates: retries from edge gateways cause duplicate seq numbers
Out-of-order events: cross-topic ordering is not guaranteed
Schema evolution: perception payload adds/removes fields monthly
Partial data: some sensors missing for a window; replay must either fail fast or degrade explicitly

Your Task

Design the end-to-end data pipeline and orchestration to support scenario extraction, indexing, and deterministic replay.

Functional requirements

Ingest telemetry streams from vehicles into a durable log with replay capability.
Normalize and validate events (schema validation, required fields, unit normalization).
Deduplicate and order events deterministically per scenario (define ordering rules).
Build scenario manifests: a manifest describes scenario boundaries, required streams, blob URIs, and a deterministic ordering/offset mapping.
Support developer APIs:
- Find scenarios by incident_id, time range, geo-fence, tags (e.g., “unprotected_left”), and vehicle IDs.
- Start replay at normal speed, 2x/0.5x, pause/resume, and seek to timestamp.
Backfill: regenerate manifests for historical data and re-run scenario extraction logic when rules change.
Auditability: every replay run must be traceable to exact input offsets, schema versions, and extraction code version.

Non-functional requirements

Exactly-once or effectively-once semantics for manifest generation (no double-counting frames).
High availability: tolerate zone loss without losing committed events.
Cost controls: keep incremental platform cost under $150K/month (compute + storage + warehouse).
Security/compliance: access controls by team; immutable audit logs for safety investigations.

Constraints

Cloud: GCP (GCS, BigQuery) already standardized.
Existing streaming: Kafka is used in some teams; platform is open to Confluent Cloud or self-managed.
Processing: team has strong Spark expertise; limited Flink experience.
Orchestration: Airflow 2.x is the standard scheduler.
Data modeling: analytics uses dbt in BigQuery.

What we’re evaluating

In your design, be explicit about:

Topic design and partitioning strategy (keys, ordering guarantees, retention)
Watermarking strategy for late data and how it affects determinism
Manifest data model (what tables/files exist, partitioning, and versioning)
How replay reads are served (from Kafka offsets vs lake files) and how you guarantee consistent ordering across streams
Backfill strategy that doesn’t overload production ingestion
Monitoring/alerting and operational playbooks for common failures

Deliverables: architecture diagram, data flow steps, tech stack choices, example code for streaming processing + orchestration/backfill, and production monitoring/failure handling.

Context

Scale Requirements

Ingestion throughput: peak 2,000,000 events/sec across all vehicles (bursty during dense urban driving)
Event size: avg 1.5 KB (protobuf), with occasional 50–200 KB blobs (compressed point clouds referenced by URI)
Daily volume: ~20 TB/day raw telemetry + 200 TB/day referenced sensor blobs (stored separately)
Scenario query latency: developers can locate and start replay of an indexed scenario in < 30 seconds
Replay control latency: pause/seek/speed changes applied in < 2 seconds
Determinism: identical inputs must yield byte-identical ordered event streams for a given scenario version
Retention: raw events 180 days, curated scenario manifests 2 years

Data Characteristics

Event types (examples)

Stream	Key fields	Notes
`vehicle_pose`	`vehicle_id`, `event_time`, `seq_no`, `x,y,z`, `heading`	High frequency (50–100 Hz)
`perception_objects`	`vehicle_id`, `event_time`, `frame_id`, `objects[]`	Medium frequency (10–20 Hz), schema evolves
`control_commands`	`vehicle_id`, `event_time`, `seq_no`, `steer`, `brake`, `throttle`	Must be strictly ordered
`incident_markers`	`incident_id`, `vehicle_id`, `event_time`, `type`, `metadata`	Low frequency, used for indexing

Quality issues you must handle

Late-arriving packets: up to 15 minutes late due to connectivity gaps
Duplicates: retries from edge gateways cause duplicate seq numbers
Out-of-order events: cross-topic ordering is not guaranteed
Schema evolution: perception payload adds/removes fields monthly
Partial data: some sensors missing for a window; replay must either fail fast or degrade explicitly

Your Task

Design the end-to-end data pipeline and orchestration to support scenario extraction, indexing, and deterministic replay.

Functional requirements

Ingest telemetry streams from vehicles into a durable log with replay capability.
Normalize and validate events (schema validation, required fields, unit normalization).
Deduplicate and order events deterministically per scenario (define ordering rules).
Build scenario manifests: a manifest describes scenario boundaries, required streams, blob URIs, and a deterministic ordering/offset mapping.
Support developer APIs:
- Find scenarios by incident_id, time range, geo-fence, tags (e.g., “unprotected_left”), and vehicle IDs.
- Start replay at normal speed, 2x/0.5x, pause/resume, and seek to timestamp.
Backfill: regenerate manifests for historical data and re-run scenario extraction logic when rules change.
Auditability: every replay run must be traceable to exact input offsets, schema versions, and extraction code version.

Non-functional requirements

Exactly-once or effectively-once semantics for manifest generation (no double-counting frames).
High availability: tolerate zone loss without losing committed events.
Cost controls: keep incremental platform cost under $150K/month (compute + storage + warehouse).
Security/compliance: access controls by team; immutable audit logs for safety investigations.

Constraints

Cloud: GCP (GCS, BigQuery) already standardized.
Existing streaming: Kafka is used in some teams; platform is open to Confluent Cloud or self-managed.
Processing: team has strong Spark expertise; limited Flink experience.
Orchestration: Airflow 2.x is the standard scheduler.
Data modeling: analytics uses dbt in BigQuery.

What we’re evaluating

In your design, be explicit about:

Topic design and partitioning strategy (keys, ordering guarantees, retention)
Watermarking strategy for late data and how it affects determinism
Manifest data model (what tables/files exist, partitioning, and versioning)
How replay reads are served (from Kafka offsets vs lake files) and how you guarantee consistent ordering across streams
Backfill strategy that doesn’t overload production ingestion
Monitoring/alerting and operational playbooks for common failures

Deliverables: architecture diagram, data flow steps, tech stack choices, example code for streaming processing + orchestration/backfill, and production monitoring/failure handling.

Context

Scale Requirements

Ingestion throughput: peak 2,000,000 events/sec across all vehicles (bursty during dense urban driving)
Event size: avg 1.5 KB (protobuf), with occasional 50–200 KB blobs (compressed point clouds referenced by URI)
Daily volume: ~20 TB/day raw telemetry + 200 TB/day referenced sensor blobs (stored separately)
Scenario query latency: developers can locate and start replay of an indexed scenario in < 30 seconds
Replay control latency: pause/seek/speed changes applied in < 2 seconds
Determinism: identical inputs must yield byte-identical ordered event streams for a given scenario version
Retention: raw events 180 days, curated scenario manifests 2 years

Data Characteristics

Event types (examples)

Stream	Key fields	Notes
`vehicle_pose`	`vehicle_id`, `event_time`, `seq_no`, `x,y,z`, `heading`	High frequency (50–100 Hz)
`perception_objects`	`vehicle_id`, `event_time`, `frame_id`, `objects[]`	Medium frequency (10–20 Hz), schema evolves
`control_commands`	`vehicle_id`, `event_time`, `seq_no`, `steer`, `brake`, `throttle`	Must be strictly ordered
`incident_markers`	`incident_id`, `vehicle_id`, `event_time`, `type`, `metadata`	Low frequency, used for indexing

Quality issues you must handle

Late-arriving packets: up to 15 minutes late due to connectivity gaps
Duplicates: retries from edge gateways cause duplicate seq numbers
Out-of-order events: cross-topic ordering is not guaranteed
Schema evolution: perception payload adds/removes fields monthly
Partial data: some sensors missing for a window; replay must either fail fast or degrade explicitly

Your Task

Design the end-to-end data pipeline and orchestration to support scenario extraction, indexing, and deterministic replay.

Functional requirements

Ingest telemetry streams from vehicles into a durable log with replay capability.
Normalize and validate events (schema validation, required fields, unit normalization).
Deduplicate and order events deterministically per scenario (define ordering rules).
Build scenario manifests: a manifest describes scenario boundaries, required streams, blob URIs, and a deterministic ordering/offset mapping.
Support developer APIs:
- Find scenarios by incident_id, time range, geo-fence, tags (e.g., “unprotected_left”), and vehicle IDs.
- Start replay at normal speed, 2x/0.5x, pause/resume, and seek to timestamp.
Backfill: regenerate manifests for historical data and re-run scenario extraction logic when rules change.
Auditability: every replay run must be traceable to exact input offsets, schema versions, and extraction code version.

Non-functional requirements

Exactly-once or effectively-once semantics for manifest generation (no double-counting frames).
High availability: tolerate zone loss without losing committed events.
Cost controls: keep incremental platform cost under $150K/month (compute + storage + warehouse).
Security/compliance: access controls by team; immutable audit logs for safety investigations.

Constraints

Cloud: GCP (GCS, BigQuery) already standardized.
Existing streaming: Kafka is used in some teams; platform is open to Confluent Cloud or self-managed.
Processing: team has strong Spark expertise; limited Flink experience.
Orchestration: Airflow 2.x is the standard scheduler.
Data modeling: analytics uses dbt in BigQuery.

What we’re evaluating

In your design, be explicit about:

Topic design and partitioning strategy (keys, ordering guarantees, retention)
Watermarking strategy for late data and how it affects determinism
Manifest data model (what tables/files exist, partitioning, and versioning)
How replay reads are served (from Kafka offsets vs lake files) and how you guarantee consistent ordering across streams
Backfill strategy that doesn’t overload production ingestion
Monitoring/alerting and operational playbooks for common failures

Deliverables: architecture diagram, data flow steps, tech stack choices, example code for streaming processing + orchestration/backfill, and production monitoring/failure handling.

Context

Scale Requirements

Ingestion throughput: peak 2,000,000 events/sec across all vehicles (bursty during dense urban driving)
Event size: avg 1.5 KB (protobuf), with occasional 50–200 KB blobs (compressed point clouds referenced by URI)
Daily volume: ~20 TB/day raw telemetry + 200 TB/day referenced sensor blobs (stored separately)
Scenario query latency: developers can locate and start replay of an indexed scenario in < 30 seconds
Replay control latency: pause/seek/speed changes applied in < 2 seconds
Determinism: identical inputs must yield byte-identical ordered event streams for a given scenario version
Retention: raw events 180 days, curated scenario manifests 2 years

Data Characteristics

Event types (examples)

Stream	Key fields	Notes
`vehicle_pose`	`vehicle_id`, `event_time`, `seq_no`, `x,y,z`, `heading`	High frequency (50–100 Hz)
`perception_objects`	`vehicle_id`, `event_time`, `frame_id`, `objects[]`	Medium frequency (10–20 Hz), schema evolves
`control_commands`	`vehicle_id`, `event_time`, `seq_no`, `steer`, `brake`, `throttle`	Must be strictly ordered
`incident_markers`	`incident_id`, `vehicle_id`, `event_time`, `type`, `metadata`	Low frequency, used for indexing

Quality issues you must handle

Late-arriving packets: up to 15 minutes late due to connectivity gaps
Duplicates: retries from edge gateways cause duplicate seq numbers
Out-of-order events: cross-topic ordering is not guaranteed
Schema evolution: perception payload adds/removes fields monthly
Partial data: some sensors missing for a window; replay must either fail fast or degrade explicitly

Your Task

Design the end-to-end data pipeline and orchestration to support scenario extraction, indexing, and deterministic replay.

Functional requirements

Ingest telemetry streams from vehicles into a durable log with replay capability.
Normalize and validate events (schema validation, required fields, unit normalization).
Deduplicate and order events deterministically per scenario (define ordering rules).
Build scenario manifests: a manifest describes scenario boundaries, required streams, blob URIs, and a deterministic ordering/offset mapping.
Support developer APIs:
- Find scenarios by incident_id, time range, geo-fence, tags (e.g., “unprotected_left”), and vehicle IDs.
- Start replay at normal speed, 2x/0.5x, pause/resume, and seek to timestamp.
Backfill: regenerate manifests for historical data and re-run scenario extraction logic when rules change.
Auditability: every replay run must be traceable to exact input offsets, schema versions, and extraction code version.

Non-functional requirements

Exactly-once or effectively-once semantics for manifest generation (no double-counting frames).
High availability: tolerate zone loss without losing committed events.
Cost controls: keep incremental platform cost under $150K/month (compute + storage + warehouse).
Security/compliance: access controls by team; immutable audit logs for safety investigations.

Constraints

Cloud: GCP (GCS, BigQuery) already standardized.
Existing streaming: Kafka is used in some teams; platform is open to Confluent Cloud or self-managed.
Processing: team has strong Spark expertise; limited Flink experience.
Orchestration: Airflow 2.x is the standard scheduler.
Data modeling: analytics uses dbt in BigQuery.

What we’re evaluating

In your design, be explicit about:

Topic design and partitioning strategy (keys, ordering guarantees, retention)
Watermarking strategy for late data and how it affects determinism
Manifest data model (what tables/files exist, partitioning, and versioning)
How replay reads are served (from Kafka offsets vs lake files) and how you guarantee consistent ordering across streams
Backfill strategy that doesn’t overload production ingestion
Monitoring/alerting and operational playbooks for common failures

Deliverables: architecture diagram, data flow steps, tech stack choices, example code for streaming processing + orchestration/backfill, and production monitoring/failure handling.

Interview Guides

Context

Scale Requirements

Data Characteristics

Event types (examples)

Quality issues you must handle

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Replayable Traffic Scenario Simulation Pipeline

Context

Scale Requirements

Data Characteristics

Event types (examples)

Quality issues you must handle

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Your Answer

Replayable Traffic Scenario Simulation Pipeline

Context

Scale Requirements

Data Characteristics

Event types (examples)

Quality issues you must handle

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Replayable Traffic Scenario Simulation Pipeline

Context

Scale Requirements

Data Characteristics

Event types (examples)

Quality issues you must handle

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Your Answer