Context
RoboFleet operates 120,000 autonomous delivery agents that continuously emit telemetry including GPS, battery state, sensor health, route decisions, and safety events. The current system uploads compressed log bundles to object storage every 30 minutes and runs hourly Spark batch jobs, which is too slow for fleet operations, incident response, and model diagnostics.
You need to design a scalable telemetry ingestion platform that supports both real-time operational monitoring and downstream analytical workloads without losing data during network instability or regional outages.
Scale Requirements
- Fleet size: 120,000 active agents globally
- Ingress rate: 1.2M events/sec average, 3M events/sec peak during software rollouts
- Event size: 0.8-3 KB protobuf or JSON payloads
- Daily volume: ~120-180 TB raw uncompressed telemetry
- Latency target: critical safety events queryable in <30 seconds; standard telemetry available in <5 minutes
- Retention: raw telemetry 180 days, curated aggregates 2 years
- Availability: 99.95% ingestion SLA across 3 regions
Requirements
- Design ingestion for intermittent connectivity, duplicate delivery, and out-of-order events from edge devices.
- Separate high-priority safety telemetry from bulk operational telemetry while preserving per-agent ordering where required.
- Validate schemas, detect malformed payloads, and quarantine bad records without blocking the main stream.
- Build streaming transformations for deduplication, enrichment, and windowed fleet metrics.
- Store raw immutable data and curated analytics-ready tables for operations, ML, and compliance use cases.
- Support replay/backfill for historical reprocessing after schema or logic changes.
- Define orchestration, monitoring, alerting, and recovery procedures for broker, processor, and warehouse failures.
Constraints
- Primary cloud is AWS; existing analytics stack uses S3, Databricks, Airflow, and Snowflake.
- Incremental budget is capped at $80K/month excluding Snowflake compute already committed.
- Telemetry includes location and incident data; system must support PII minimization, encryption, and regional data residency.
- Edge agents may buffer up to 6 hours of data and reconnect in bursts, causing replay spikes.