Real-Time Sensor Fusion Data Pipeline

Context

AuroraFleet is a logistics company operating 120,000 connected delivery vehicles across North America and the EU. Each vehicle streams telemetry from multiple sensors used for driver safety and predictive maintenance: GPS, IMU (accelerometer/gyroscope), wheel speed, engine CAN bus, and a forward-facing camera module that emits object detections (not raw video). Today, each sensor is ingested independently into an S3 data lake and processed in hourly Spark batch jobs. This architecture creates 30–90 minute freshness lag, and the safety team cannot reliably detect harsh braking + obstacle proximity events in time to trigger coaching workflows or to validate incident claims.

You are asked to design a real-time sensor fusion pipeline that combines inputs from multiple sensors into a single “fused” event stream per vehicle, aligned on time, robust to out-of-order and late-arriving data, and queryable for both real-time alerting and analytics. The fused stream will feed (1) a near-real-time dashboard used by the Safety Operations Center, and (2) a Snowflake warehouse used by analysts and ML feature pipelines.

Scale Requirements

Fleet size: 120k vehicles active daily; peak concurrent vehicles ~70k
Ingestion throughput (peak):
- GPS: ~1 Hz per vehicle → up to 70k events/sec
- IMU: 50 Hz per vehicle (batched on device into 1-second packets) → up to 70k packets/sec (each packet contains 50 samples)
- Wheel speed: 10 Hz → up to 700k events/sec
- CAN bus summary: 1 Hz → up to 70k events/sec
- Camera detections: bursty, avg 0.2 Hz, peak 5 Hz → up to 350k events/sec during dense traffic
End-to-end latency: P95 < 2 seconds from event time to fused output availability for alerting
Late/out-of-order: up to 5 minutes late due to cellular dead zones; out-of-order within a vehicle is common
Retention:
- Raw sensor data: 30 days hot, 1 year cold
- Fused feature tables: 2 years
Availability: 99.9% for streaming path; no single-AZ dependency

Data Characteristics

Example input schemas (simplified)

Sensor	Key fields	Notes
GPS	vehicle_id, event_time, lat, lon, speed_mps, heading	event_time from device clock; may drift
IMU packet	vehicle_id, packet_start_time, samples[ {t_offset_ms, ax, ay, az, gx, gy, gz} ]	50 samples/sec; packet may arrive late
Wheel speed	vehicle_id, event_time, wheel_speed_mps	higher frequency; occasional duplicates
CAN summary	vehicle_id, event_time, rpm, throttle_pct, brake_pressed	sparse but critical
Camera detections	vehicle_id, event_time, objects[ {type, distance_m, confidence} ]	no images; privacy-friendly

Quality issues you must handle

Clock skew: device clocks can drift ±2 seconds; NTP sync is best-effort
Duplicates: retries from edge gateway can cause duplicate messages (same event_id)
Partial data: camera module may be offline; IMU packets can be dropped
Schema evolution: new object types and CAN fields added quarterly
Backfills: after connectivity returns, vehicles can upload a backlog of 1–5 minutes

Requirements

Functional requirements

Ingest all sensor streams in real time with durable buffering.
Normalize sensor payloads into a common canonical format (types, units, metadata).
Time-align and fuse signals per vehicle into a unified timeline at a configurable resolution (e.g., 100ms or 1s), producing fused records such as:
- fused_time, vehicle_id
- location (best estimate)
- kinematics (speed, acceleration)
- braking/harsh event flags
- nearest obstacle distance (from camera detections)
Support late-arriving data and retractions/updates to fused outputs within a 5-minute window.
Produce outputs for:
- real-time alerting (e.g., harsh brake + obstacle < 8m)
- analytics/warehouse (Snowflake tables for investigations and reporting)
Provide a reprocessing/backfill mechanism for a given vehicle_id and time range.

Non-functional requirements

Exactly-once or effectively-once fused outputs (no double-counting in downstream tables).
Strong observability: end-to-end latency, lag, data quality, and per-sensor drop rates.
Security & compliance: EU data residency for EU vehicles; encryption in transit and at rest.
Cost-aware design: target incremental run cost < $120k/month.

Constraints

Cloud: AWS primary; Snowflake already used for analytics.
Team: 6 data engineers; strong Spark + Airflow; moderate Kafka; minimal Flink.
Existing ingestion: vehicles publish to an edge gateway that can produce to Kafka.
Must support multi-region (us-east-1, eu-central-1) with separate storage for EU.

What we’re evaluating

Your end-to-end architecture choices (Kafka topics, partitioning, streaming compute, storage).
How you implement time alignment, watermarking, and update semantics.
Data modeling for raw vs canonical vs fused layers.
Orchestration for backfills and schema evolution.
Monitoring, failure recovery, and correctness under duplicates and late data.

Deliverables in your answer:

A component-level architecture and data flow.
Topic/table design and partitioning strategy.
How you compute fused records (windowing strategy, state, watermark).
How you load to Snowflake and keep it consistent with updates.
Monitoring + alerting plan and failure handling.

Context

Scale Requirements

Fleet size: 120k vehicles active daily; peak concurrent vehicles ~70k
Ingestion throughput (peak):
- GPS: ~1 Hz per vehicle → up to 70k events/sec
- IMU: 50 Hz per vehicle (batched on device into 1-second packets) → up to 70k packets/sec (each packet contains 50 samples)
- Wheel speed: 10 Hz → up to 700k events/sec
- CAN bus summary: 1 Hz → up to 70k events/sec
- Camera detections: bursty, avg 0.2 Hz, peak 5 Hz → up to 350k events/sec during dense traffic
End-to-end latency: P95 < 2 seconds from event time to fused output availability for alerting
Late/out-of-order: up to 5 minutes late due to cellular dead zones; out-of-order within a vehicle is common
Retention:
- Raw sensor data: 30 days hot, 1 year cold
- Fused feature tables: 2 years
Availability: 99.9% for streaming path; no single-AZ dependency

Data Characteristics

Example input schemas (simplified)

Sensor	Key fields	Notes
GPS	vehicle_id, event_time, lat, lon, speed_mps, heading	event_time from device clock; may drift
IMU packet	vehicle_id, packet_start_time, samples[ {t_offset_ms, ax, ay, az, gx, gy, gz} ]	50 samples/sec; packet may arrive late
Wheel speed	vehicle_id, event_time, wheel_speed_mps	higher frequency; occasional duplicates
CAN summary	vehicle_id, event_time, rpm, throttle_pct, brake_pressed	sparse but critical
Camera detections	vehicle_id, event_time, objects[ {type, distance_m, confidence} ]	no images; privacy-friendly

Quality issues you must handle

Clock skew: device clocks can drift ±2 seconds; NTP sync is best-effort
Duplicates: retries from edge gateway can cause duplicate messages (same event_id)
Partial data: camera module may be offline; IMU packets can be dropped
Schema evolution: new object types and CAN fields added quarterly
Backfills: after connectivity returns, vehicles can upload a backlog of 1–5 minutes

Requirements

Functional requirements

Ingest all sensor streams in real time with durable buffering.
Normalize sensor payloads into a common canonical format (types, units, metadata).
Time-align and fuse signals per vehicle into a unified timeline at a configurable resolution (e.g., 100ms or 1s), producing fused records such as:
- fused_time, vehicle_id
- location (best estimate)
- kinematics (speed, acceleration)
- braking/harsh event flags
- nearest obstacle distance (from camera detections)
Support late-arriving data and retractions/updates to fused outputs within a 5-minute window.
Produce outputs for:
- real-time alerting (e.g., harsh brake + obstacle < 8m)
- analytics/warehouse (Snowflake tables for investigations and reporting)
Provide a reprocessing/backfill mechanism for a given vehicle_id and time range.

Non-functional requirements

Exactly-once or effectively-once fused outputs (no double-counting in downstream tables).
Strong observability: end-to-end latency, lag, data quality, and per-sensor drop rates.
Security & compliance: EU data residency for EU vehicles; encryption in transit and at rest.
Cost-aware design: target incremental run cost < $120k/month.

Constraints

Cloud: AWS primary; Snowflake already used for analytics.
Team: 6 data engineers; strong Spark + Airflow; moderate Kafka; minimal Flink.
Existing ingestion: vehicles publish to an edge gateway that can produce to Kafka.
Must support multi-region (us-east-1, eu-central-1) with separate storage for EU.

What we’re evaluating

Your end-to-end architecture choices (Kafka topics, partitioning, streaming compute, storage).
How you implement time alignment, watermarking, and update semantics.
Data modeling for raw vs canonical vs fused layers.
Orchestration for backfills and schema evolution.
Monitoring, failure recovery, and correctness under duplicates and late data.

Deliverables in your answer:

A component-level architecture and data flow.
Topic/table design and partitioning strategy.
How you compute fused records (windowing strategy, state, watermark).
How you load to Snowflake and keep it consistent with updates.
Monitoring + alerting plan and failure handling.

Context

Scale Requirements

Fleet size: 120k vehicles active daily; peak concurrent vehicles ~70k
Ingestion throughput (peak):
- GPS: ~1 Hz per vehicle → up to 70k events/sec
- IMU: 50 Hz per vehicle (batched on device into 1-second packets) → up to 70k packets/sec (each packet contains 50 samples)
- Wheel speed: 10 Hz → up to 700k events/sec
- CAN bus summary: 1 Hz → up to 70k events/sec
- Camera detections: bursty, avg 0.2 Hz, peak 5 Hz → up to 350k events/sec during dense traffic
End-to-end latency: P95 < 2 seconds from event time to fused output availability for alerting
Late/out-of-order: up to 5 minutes late due to cellular dead zones; out-of-order within a vehicle is common
Retention:
- Raw sensor data: 30 days hot, 1 year cold
- Fused feature tables: 2 years
Availability: 99.9% for streaming path; no single-AZ dependency

Data Characteristics

Example input schemas (simplified)

Sensor	Key fields	Notes
GPS	vehicle_id, event_time, lat, lon, speed_mps, heading	event_time from device clock; may drift
IMU packet	vehicle_id, packet_start_time, samples[ {t_offset_ms, ax, ay, az, gx, gy, gz} ]	50 samples/sec; packet may arrive late
Wheel speed	vehicle_id, event_time, wheel_speed_mps	higher frequency; occasional duplicates
CAN summary	vehicle_id, event_time, rpm, throttle_pct, brake_pressed	sparse but critical
Camera detections	vehicle_id, event_time, objects[ {type, distance_m, confidence} ]	no images; privacy-friendly

Quality issues you must handle

Clock skew: device clocks can drift ±2 seconds; NTP sync is best-effort
Duplicates: retries from edge gateway can cause duplicate messages (same event_id)
Partial data: camera module may be offline; IMU packets can be dropped
Schema evolution: new object types and CAN fields added quarterly
Backfills: after connectivity returns, vehicles can upload a backlog of 1–5 minutes

Requirements

Functional requirements

Ingest all sensor streams in real time with durable buffering.
Normalize sensor payloads into a common canonical format (types, units, metadata).
Time-align and fuse signals per vehicle into a unified timeline at a configurable resolution (e.g., 100ms or 1s), producing fused records such as:
- fused_time, vehicle_id
- location (best estimate)
- kinematics (speed, acceleration)
- braking/harsh event flags
- nearest obstacle distance (from camera detections)
Support late-arriving data and retractions/updates to fused outputs within a 5-minute window.
Produce outputs for:
- real-time alerting (e.g., harsh brake + obstacle < 8m)
- analytics/warehouse (Snowflake tables for investigations and reporting)
Provide a reprocessing/backfill mechanism for a given vehicle_id and time range.

Non-functional requirements

Exactly-once or effectively-once fused outputs (no double-counting in downstream tables).
Strong observability: end-to-end latency, lag, data quality, and per-sensor drop rates.
Security & compliance: EU data residency for EU vehicles; encryption in transit and at rest.
Cost-aware design: target incremental run cost < $120k/month.

Constraints

Cloud: AWS primary; Snowflake already used for analytics.
Team: 6 data engineers; strong Spark + Airflow; moderate Kafka; minimal Flink.
Existing ingestion: vehicles publish to an edge gateway that can produce to Kafka.
Must support multi-region (us-east-1, eu-central-1) with separate storage for EU.

What we’re evaluating

Your end-to-end architecture choices (Kafka topics, partitioning, streaming compute, storage).
How you implement time alignment, watermarking, and update semantics.
Data modeling for raw vs canonical vs fused layers.
Orchestration for backfills and schema evolution.
Monitoring, failure recovery, and correctness under duplicates and late data.

Deliverables in your answer:

A component-level architecture and data flow.
Topic/table design and partitioning strategy.
How you compute fused records (windowing strategy, state, watermark).
How you load to Snowflake and keep it consistent with updates.
Monitoring + alerting plan and failure handling.

Context

Scale Requirements

Fleet size: 120k vehicles active daily; peak concurrent vehicles ~70k
Ingestion throughput (peak):
- GPS: ~1 Hz per vehicle → up to 70k events/sec
- IMU: 50 Hz per vehicle (batched on device into 1-second packets) → up to 70k packets/sec (each packet contains 50 samples)
- Wheel speed: 10 Hz → up to 700k events/sec
- CAN bus summary: 1 Hz → up to 70k events/sec
- Camera detections: bursty, avg 0.2 Hz, peak 5 Hz → up to 350k events/sec during dense traffic
End-to-end latency: P95 < 2 seconds from event time to fused output availability for alerting
Late/out-of-order: up to 5 minutes late due to cellular dead zones; out-of-order within a vehicle is common
Retention:
- Raw sensor data: 30 days hot, 1 year cold
- Fused feature tables: 2 years
Availability: 99.9% for streaming path; no single-AZ dependency

Data Characteristics

Example input schemas (simplified)

Sensor	Key fields	Notes
GPS	vehicle_id, event_time, lat, lon, speed_mps, heading	event_time from device clock; may drift
IMU packet	vehicle_id, packet_start_time, samples[ {t_offset_ms, ax, ay, az, gx, gy, gz} ]	50 samples/sec; packet may arrive late
Wheel speed	vehicle_id, event_time, wheel_speed_mps	higher frequency; occasional duplicates
CAN summary	vehicle_id, event_time, rpm, throttle_pct, brake_pressed	sparse but critical
Camera detections	vehicle_id, event_time, objects[ {type, distance_m, confidence} ]	no images; privacy-friendly

Quality issues you must handle

Clock skew: device clocks can drift ±2 seconds; NTP sync is best-effort
Duplicates: retries from edge gateway can cause duplicate messages (same event_id)
Partial data: camera module may be offline; IMU packets can be dropped
Schema evolution: new object types and CAN fields added quarterly
Backfills: after connectivity returns, vehicles can upload a backlog of 1–5 minutes

Requirements

Functional requirements

Ingest all sensor streams in real time with durable buffering.
Normalize sensor payloads into a common canonical format (types, units, metadata).
Time-align and fuse signals per vehicle into a unified timeline at a configurable resolution (e.g., 100ms or 1s), producing fused records such as:
- fused_time, vehicle_id
- location (best estimate)
- kinematics (speed, acceleration)
- braking/harsh event flags
- nearest obstacle distance (from camera detections)
Support late-arriving data and retractions/updates to fused outputs within a 5-minute window.
Produce outputs for:
- real-time alerting (e.g., harsh brake + obstacle < 8m)
- analytics/warehouse (Snowflake tables for investigations and reporting)
Provide a reprocessing/backfill mechanism for a given vehicle_id and time range.

Non-functional requirements

Exactly-once or effectively-once fused outputs (no double-counting in downstream tables).
Strong observability: end-to-end latency, lag, data quality, and per-sensor drop rates.
Security & compliance: EU data residency for EU vehicles; encryption in transit and at rest.
Cost-aware design: target incremental run cost < $120k/month.

Constraints

Cloud: AWS primary; Snowflake already used for analytics.
Team: 6 data engineers; strong Spark + Airflow; moderate Kafka; minimal Flink.
Existing ingestion: vehicles publish to an edge gateway that can produce to Kafka.
Must support multi-region (us-east-1, eu-central-1) with separate storage for EU.

What we’re evaluating

Your end-to-end architecture choices (Kafka topics, partitioning, streaming compute, storage).
How you implement time alignment, watermarking, and update semantics.
Data modeling for raw vs canonical vs fused layers.
Orchestration for backfills and schema evolution.
Monitoring, failure recovery, and correctness under duplicates and late data.

Deliverables in your answer:

A component-level architecture and data flow.
Topic/table design and partitioning strategy.
How you compute fused records (windowing strategy, state, watermark).
How you load to Snowflake and keep it consistent with updates.
Monitoring + alerting plan and failure handling.

Interview Guides

Context

Scale Requirements

Data Characteristics

Example input schemas (simplified)

Quality issues you must handle

Requirements

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Real-Time Sensor Fusion Data Pipeline

Context

Scale Requirements

Data Characteristics

Example input schemas (simplified)

Quality issues you must handle

Requirements

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Your Answer

Real-Time Sensor Fusion Data Pipeline

Context

Scale Requirements

Data Characteristics

Example input schemas (simplified)

Quality issues you must handle

Requirements

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Real-Time Sensor Fusion Data Pipeline

Context

Scale Requirements

Data Characteristics

Example input schemas (simplified)

Quality issues you must handle

Requirements

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Your Answer