Context
AuroraFleet is a logistics company operating 120,000 connected delivery vehicles across North America and the EU. Each vehicle streams telemetry from multiple sensors used for driver safety and predictive maintenance: GPS, IMU (accelerometer/gyroscope), wheel speed, engine CAN bus, and a forward-facing camera module that emits object detections (not raw video). Today, each sensor is ingested independently into an S3 data lake and processed in hourly Spark batch jobs. This architecture creates 30–90 minute freshness lag, and the safety team cannot reliably detect harsh braking + obstacle proximity events in time to trigger coaching workflows or to validate incident claims.
You are asked to design a real-time sensor fusion pipeline that combines inputs from multiple sensors into a single “fused” event stream per vehicle, aligned on time, robust to out-of-order and late-arriving data, and queryable for both real-time alerting and analytics. The fused stream will feed (1) a near-real-time dashboard used by the Safety Operations Center, and (2) a Snowflake warehouse used by analysts and ML feature pipelines.
Scale Requirements
- Fleet size: 120k vehicles active daily; peak concurrent vehicles ~70k
- Ingestion throughput (peak):
- GPS: ~1 Hz per vehicle → up to 70k events/sec
- IMU: 50 Hz per vehicle (batched on device into 1-second packets) → up to 70k packets/sec (each packet contains 50 samples)
- Wheel speed: 10 Hz → up to 700k events/sec
- CAN bus summary: 1 Hz → up to 70k events/sec
- Camera detections: bursty, avg 0.2 Hz, peak 5 Hz → up to 350k events/sec during dense traffic
- End-to-end latency: P95 < 2 seconds from event time to fused output availability for alerting
- Late/out-of-order: up to 5 minutes late due to cellular dead zones; out-of-order within a vehicle is common
- Retention:
- Raw sensor data: 30 days hot, 1 year cold
- Fused feature tables: 2 years
- Availability: 99.9% for streaming path; no single-AZ dependency
Data Characteristics
Example input schemas (simplified)
| Sensor | Key fields | Notes |
|---|
| GPS | vehicle_id, event_time, lat, lon, speed_mps, heading | event_time from device clock; may drift |
| IMU packet | vehicle_id, packet_start_time, samples[ {t_offset_ms, ax, ay, az, gx, gy, gz} ] | 50 samples/sec; packet may arrive late |
| Wheel speed | vehicle_id, event_time, wheel_speed_mps | higher frequency; occasional duplicates |
| CAN summary | vehicle_id, event_time, rpm, throttle_pct, brake_pressed | sparse but critical |
| Camera detections | vehicle_id, event_time, objects[ {type, distance_m, confidence} ] | no images; privacy-friendly |
Quality issues you must handle
- Clock skew: device clocks can drift ±2 seconds; NTP sync is best-effort
- Duplicates: retries from edge gateway can cause duplicate messages (same event_id)
- Partial data: camera module may be offline; IMU packets can be dropped
- Schema evolution: new object types and CAN fields added quarterly
- Backfills: after connectivity returns, vehicles can upload a backlog of 1–5 minutes
Requirements
Functional requirements
- Ingest all sensor streams in real time with durable buffering.
- Normalize sensor payloads into a common canonical format (types, units, metadata).
- Time-align and fuse signals per vehicle into a unified timeline at a configurable resolution (e.g., 100ms or 1s), producing fused records such as:
- fused_time, vehicle_id
- location (best estimate)
- kinematics (speed, acceleration)
- braking/harsh event flags
- nearest obstacle distance (from camera detections)
- Support late-arriving data and retractions/updates to fused outputs within a 5-minute window.
- Produce outputs for:
- real-time alerting (e.g., harsh brake + obstacle < 8m)
- analytics/warehouse (Snowflake tables for investigations and reporting)
- Provide a reprocessing/backfill mechanism for a given vehicle_id and time range.
Non-functional requirements
- Exactly-once or effectively-once fused outputs (no double-counting in downstream tables).
- Strong observability: end-to-end latency, lag, data quality, and per-sensor drop rates.
- Security & compliance: EU data residency for EU vehicles; encryption in transit and at rest.
- Cost-aware design: target incremental run cost < $120k/month.
Constraints
- Cloud: AWS primary; Snowflake already used for analytics.
- Team: 6 data engineers; strong Spark + Airflow; moderate Kafka; minimal Flink.
- Existing ingestion: vehicles publish to an edge gateway that can produce to Kafka.
- Must support multi-region (us-east-1, eu-central-1) with separate storage for EU.
What we’re evaluating
- Your end-to-end architecture choices (Kafka topics, partitioning, streaming compute, storage).
- How you implement time alignment, watermarking, and update semantics.
- Data modeling for raw vs canonical vs fused layers.
- Orchestration for backfills and schema evolution.
- Monitoring, failure recovery, and correctness under duplicates and late data.
Deliverables in your answer:
- A component-level architecture and data flow.
- Topic/table design and partitioning strategy.
- How you compute fused records (windowing strategy, state, watermark).
- How you load to Snowflake and keep it consistent with updates.
- Monitoring + alerting plan and failure handling.