Design Real-Time IoT Sensor Pipeline

Context

NorthGrid operates industrial HVAC and power-monitoring equipment across commercial buildings. The current pipeline uploads CSV files from edge gateways every 15 minutes into S3, then runs hourly Spark batch jobs; this delay is too high for anomaly detection, operational dashboards, and alerting.

You need to design a scalable real-time data pipeline for sensor telemetry from thousands of endpoints while preserving raw data for replay and supporting downstream analytics.

Scale Requirements

Endpoints: 75,000 active sensors across 6,000 sites
Throughput: 180K events/sec peak, 45K avg
Event size: 0.8-1.5 KB JSON or Protobuf payloads
Latency target: P95 < 10 seconds from device emission to queryable curated store
Retention: Raw immutable data for 180 days; curated aggregates for 3 years
Availability: 99.9% ingestion uptime
Data quality: < 0.5% duplicate rate, < 0.1% malformed events after validation

Requirements

Ingest telemetry from MQTT/HTTPS edge gateways with backpressure handling and horizontal scalability.
Validate schema, deduplicate by event_id, and quarantine malformed or late events.
Enrich records with device metadata, site information, and standardized timestamps.
Support both real-time operational queries and downstream warehouse analytics.
Guarantee idempotent writes and replay capability for backfills or processor failures.
Provide monitoring for lag, latency, data quality, and cost.
Explain partitioning, checkpointing, and schema evolution strategy.

Constraints

AWS is the required cloud; existing footprint includes S3, IAM, and CloudWatch.
Team size is 5 engineers, so operational complexity should be controlled.
Incremental infrastructure budget is $35K/month.
Some sites have intermittent connectivity; the design must tolerate bursts after reconnect.
Compliance requires auditability and deletion of site-level data within 7 days of request.

Context

You need to design a scalable real-time data pipeline for sensor telemetry from thousands of endpoints while preserving raw data for replay and supporting downstream analytics.

Scale Requirements

Endpoints: 75,000 active sensors across 6,000 sites
Throughput: 180K events/sec peak, 45K avg
Event size: 0.8-1.5 KB JSON or Protobuf payloads
Latency target: P95 < 10 seconds from device emission to queryable curated store
Retention: Raw immutable data for 180 days; curated aggregates for 3 years
Availability: 99.9% ingestion uptime
Data quality: < 0.5% duplicate rate, < 0.1% malformed events after validation

Requirements

Ingest telemetry from MQTT/HTTPS edge gateways with backpressure handling and horizontal scalability.
Validate schema, deduplicate by event_id, and quarantine malformed or late events.
Enrich records with device metadata, site information, and standardized timestamps.
Support both real-time operational queries and downstream warehouse analytics.
Guarantee idempotent writes and replay capability for backfills or processor failures.
Provide monitoring for lag, latency, data quality, and cost.
Explain partitioning, checkpointing, and schema evolution strategy.

Constraints

AWS is the required cloud; existing footprint includes S3, IAM, and CloudWatch.
Team size is 5 engineers, so operational complexity should be controlled.
Incremental infrastructure budget is $35K/month.
Some sites have intermittent connectivity; the design must tolerate bursts after reconnect.
Compliance requires auditability and deletion of site-level data within 7 days of request.

Context

You need to design a scalable real-time data pipeline for sensor telemetry from thousands of endpoints while preserving raw data for replay and supporting downstream analytics.

Scale Requirements

Endpoints: 75,000 active sensors across 6,000 sites
Throughput: 180K events/sec peak, 45K avg
Event size: 0.8-1.5 KB JSON or Protobuf payloads
Latency target: P95 < 10 seconds from device emission to queryable curated store
Retention: Raw immutable data for 180 days; curated aggregates for 3 years
Availability: 99.9% ingestion uptime
Data quality: < 0.5% duplicate rate, < 0.1% malformed events after validation

Requirements

Ingest telemetry from MQTT/HTTPS edge gateways with backpressure handling and horizontal scalability.
Validate schema, deduplicate by event_id, and quarantine malformed or late events.
Enrich records with device metadata, site information, and standardized timestamps.
Support both real-time operational queries and downstream warehouse analytics.
Guarantee idempotent writes and replay capability for backfills or processor failures.
Provide monitoring for lag, latency, data quality, and cost.
Explain partitioning, checkpointing, and schema evolution strategy.

Constraints

AWS is the required cloud; existing footprint includes S3, IAM, and CloudWatch.
Team size is 5 engineers, so operational complexity should be controlled.
Incremental infrastructure budget is $35K/month.
Some sites have intermittent connectivity; the design must tolerate bursts after reconnect.
Compliance requires auditability and deletion of site-level data within 7 days of request.

Context

You need to design a scalable real-time data pipeline for sensor telemetry from thousands of endpoints while preserving raw data for replay and supporting downstream analytics.

Scale Requirements

Endpoints: 75,000 active sensors across 6,000 sites
Throughput: 180K events/sec peak, 45K avg
Event size: 0.8-1.5 KB JSON or Protobuf payloads
Latency target: P95 < 10 seconds from device emission to queryable curated store
Retention: Raw immutable data for 180 days; curated aggregates for 3 years
Availability: 99.9% ingestion uptime
Data quality: < 0.5% duplicate rate, < 0.1% malformed events after validation

Requirements

Ingest telemetry from MQTT/HTTPS edge gateways with backpressure handling and horizontal scalability.
Validate schema, deduplicate by event_id, and quarantine malformed or late events.
Enrich records with device metadata, site information, and standardized timestamps.
Support both real-time operational queries and downstream warehouse analytics.
Guarantee idempotent writes and replay capability for backfills or processor failures.
Provide monitoring for lag, latency, data quality, and cost.
Explain partitioning, checkpointing, and schema evolution strategy.

Constraints

AWS is the required cloud; existing footprint includes S3, IAM, and CloudWatch.
Team size is 5 engineers, so operational complexity should be controlled.
Incremental infrastructure budget is $35K/month.
Some sites have intermittent connectivity; the design must tolerate bursts after reconnect.
Compliance requires auditability and deletion of site-level data within 7 days of request.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Real-Time IoT Sensor Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Real-Time IoT Sensor Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Real-Time IoT Sensor Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer