Design HA Telemetry Ingestion Pipeline

Context

PulseGrid operates observability agents deployed across mobile apps, edge devices, and backend services. Its current ingestion path uses regional REST collectors writing compressed JSON files to Amazon S3, followed by hourly Spark ETL into Snowflake. This architecture cannot support real-time alerting, tenant isolation, or burst traffic from incident storms.

You need to design a highly available telemetry ingestion pipeline that can ingest metrics, logs, and lightweight traces with strict durability and low-latency delivery to downstream analytics and alerting systems.

Scale Requirements

Peak throughput: 3 million events/second globally, 800K sustained average
Event size: 0.8-1.5 KB average JSON or Protobuf payloads
Daily volume: ~180-220 TB raw uncompressed
Latency target: P95 ingest-to-queryable under 60 seconds; P99 under 2 minutes
Availability target: 99.99% ingestion availability across 3 AWS regions
Retention: 30 days hot searchable data, 1 year cold object storage archive

Requirements

Design a multi-region ingestion layer that supports backpressure, replay, and tenant-aware routing.
Validate schemas, enforce idempotency, and quarantine malformed or poison events without blocking healthy traffic.
Support real-time stream processing for enrichment, deduplication, partitioning, and rollups.
Deliver data to both a low-latency analytics store and a durable raw data lake for reprocessing.
Provide orchestration for backfills, schema rollouts, replay jobs, and downstream aggregate builds.
Define monitoring, alerting, SLOs, and recovery procedures for broker, processor, and sink failures.
Explain how the design handles regional failover and exactly-once or effectively-once guarantees where feasible.

Constraints

AWS is the mandated cloud; managed services are preferred over self-hosted clusters.
Incremental infrastructure budget is capped at $120K/month.
Some tenants require data residency within the EU region.
The platform team is small: 5 data engineers, 2 SREs, so operational complexity must be justified.
Downstream consumers include both real-time dashboards and batch ML feature generation.

Context

Scale Requirements

Peak throughput: 3 million events/second globally, 800K sustained average
Event size: 0.8-1.5 KB average JSON or Protobuf payloads
Daily volume: ~180-220 TB raw uncompressed
Latency target: P95 ingest-to-queryable under 60 seconds; P99 under 2 minutes
Availability target: 99.99% ingestion availability across 3 AWS regions
Retention: 30 days hot searchable data, 1 year cold object storage archive

Requirements

Design a multi-region ingestion layer that supports backpressure, replay, and tenant-aware routing.
Validate schemas, enforce idempotency, and quarantine malformed or poison events without blocking healthy traffic.
Support real-time stream processing for enrichment, deduplication, partitioning, and rollups.
Deliver data to both a low-latency analytics store and a durable raw data lake for reprocessing.
Provide orchestration for backfills, schema rollouts, replay jobs, and downstream aggregate builds.
Define monitoring, alerting, SLOs, and recovery procedures for broker, processor, and sink failures.
Explain how the design handles regional failover and exactly-once or effectively-once guarantees where feasible.

Constraints

AWS is the mandated cloud; managed services are preferred over self-hosted clusters.
Incremental infrastructure budget is capped at $120K/month.
Some tenants require data residency within the EU region.
The platform team is small: 5 data engineers, 2 SREs, so operational complexity must be justified.
Downstream consumers include both real-time dashboards and batch ML feature generation.

Context

Scale Requirements

Peak throughput: 3 million events/second globally, 800K sustained average
Event size: 0.8-1.5 KB average JSON or Protobuf payloads
Daily volume: ~180-220 TB raw uncompressed
Latency target: P95 ingest-to-queryable under 60 seconds; P99 under 2 minutes
Availability target: 99.99% ingestion availability across 3 AWS regions
Retention: 30 days hot searchable data, 1 year cold object storage archive

Requirements

Design a multi-region ingestion layer that supports backpressure, replay, and tenant-aware routing.
Validate schemas, enforce idempotency, and quarantine malformed or poison events without blocking healthy traffic.
Support real-time stream processing for enrichment, deduplication, partitioning, and rollups.
Deliver data to both a low-latency analytics store and a durable raw data lake for reprocessing.
Provide orchestration for backfills, schema rollouts, replay jobs, and downstream aggregate builds.
Define monitoring, alerting, SLOs, and recovery procedures for broker, processor, and sink failures.
Explain how the design handles regional failover and exactly-once or effectively-once guarantees where feasible.

Constraints

AWS is the mandated cloud; managed services are preferred over self-hosted clusters.
Incremental infrastructure budget is capped at $120K/month.
Some tenants require data residency within the EU region.
The platform team is small: 5 data engineers, 2 SREs, so operational complexity must be justified.
Downstream consumers include both real-time dashboards and batch ML feature generation.

Context

Scale Requirements

Peak throughput: 3 million events/second globally, 800K sustained average
Event size: 0.8-1.5 KB average JSON or Protobuf payloads
Daily volume: ~180-220 TB raw uncompressed
Latency target: P95 ingest-to-queryable under 60 seconds; P99 under 2 minutes
Availability target: 99.99% ingestion availability across 3 AWS regions
Retention: 30 days hot searchable data, 1 year cold object storage archive

Requirements

Design a multi-region ingestion layer that supports backpressure, replay, and tenant-aware routing.
Validate schemas, enforce idempotency, and quarantine malformed or poison events without blocking healthy traffic.
Support real-time stream processing for enrichment, deduplication, partitioning, and rollups.
Deliver data to both a low-latency analytics store and a durable raw data lake for reprocessing.
Provide orchestration for backfills, schema rollouts, replay jobs, and downstream aggregate builds.
Define monitoring, alerting, SLOs, and recovery procedures for broker, processor, and sink failures.
Explain how the design handles regional failover and exactly-once or effectively-once guarantees where feasible.

Constraints

AWS is the mandated cloud; managed services are preferred over self-hosted clusters.
Incremental infrastructure budget is capped at $120K/month.
Some tenants require data residency within the EU region.
The platform team is small: 5 data engineers, 2 SREs, so operational complexity must be justified.
Downstream consumers include both real-time dashboards and batch ML feature generation.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design HA Telemetry Ingestion Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design HA Telemetry Ingestion Pipeline

Context

Scale Requirements

Requirements

Constraints

Design HA Telemetry Ingestion Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer