Context
NorthGrid operates industrial HVAC and power-monitoring equipment across commercial buildings. The current pipeline uploads CSV files from edge gateways every 15 minutes into S3, then runs hourly Spark batch jobs; this delay is too high for anomaly detection, operational dashboards, and alerting.
You need to design a scalable real-time data pipeline for sensor telemetry from thousands of endpoints while preserving raw data for replay and supporting downstream analytics.
Scale Requirements
- Endpoints: 75,000 active sensors across 6,000 sites
- Throughput: 180K events/sec peak, 45K avg
- Event size: 0.8-1.5 KB JSON or Protobuf payloads
- Latency target: P95 < 10 seconds from device emission to queryable curated store
- Retention: Raw immutable data for 180 days; curated aggregates for 3 years
- Availability: 99.9% ingestion uptime
- Data quality: < 0.5% duplicate rate, < 0.1% malformed events after validation
Requirements
- Ingest telemetry from MQTT/HTTPS edge gateways with backpressure handling and horizontal scalability.
- Validate schema, deduplicate by
event_id, and quarantine malformed or late events.
- Enrich records with device metadata, site information, and standardized timestamps.
- Support both real-time operational queries and downstream warehouse analytics.
- Guarantee idempotent writes and replay capability for backfills or processor failures.
- Provide monitoring for lag, latency, data quality, and cost.
- Explain partitioning, checkpointing, and schema evolution strategy.
Constraints
- AWS is the required cloud; existing footprint includes S3, IAM, and CloudWatch.
- Team size is 5 engineers, so operational complexity should be controlled.
- Incremental infrastructure budget is $35K/month.
- Some sites have intermittent connectivity; the design must tolerate bursts after reconnect.
- Compliance requires auditability and deletion of site-level data within 7 days of request.