Context
PulseGrid operates observability agents deployed across mobile apps, edge devices, and backend services. Its current ingestion path uses regional REST collectors writing compressed JSON files to Amazon S3, followed by hourly Spark ETL into Snowflake. This architecture cannot support real-time alerting, tenant isolation, or burst traffic from incident storms.
You need to design a highly available telemetry ingestion pipeline that can ingest metrics, logs, and lightweight traces with strict durability and low-latency delivery to downstream analytics and alerting systems.
Scale Requirements
- Peak throughput: 3 million events/second globally, 800K sustained average
- Event size: 0.8-1.5 KB average JSON or Protobuf payloads
- Daily volume: ~180-220 TB raw uncompressed
- Latency target: P95 ingest-to-queryable under 60 seconds; P99 under 2 minutes
- Availability target: 99.99% ingestion availability across 3 AWS regions
- Retention: 30 days hot searchable data, 1 year cold object storage archive
Requirements
- Design a multi-region ingestion layer that supports backpressure, replay, and tenant-aware routing.
- Validate schemas, enforce idempotency, and quarantine malformed or poison events without blocking healthy traffic.
- Support real-time stream processing for enrichment, deduplication, partitioning, and rollups.
- Deliver data to both a low-latency analytics store and a durable raw data lake for reprocessing.
- Provide orchestration for backfills, schema rollouts, replay jobs, and downstream aggregate builds.
- Define monitoring, alerting, SLOs, and recovery procedures for broker, processor, and sink failures.
- Explain how the design handles regional failover and exactly-once or effectively-once guarantees where feasible.
Constraints
- AWS is the mandated cloud; managed services are preferred over self-hosted clusters.
- Incremental infrastructure budget is capped at $120K/month.
- Some tenants require data residency within the EU region.
- The platform team is small: 5 data engineers, 2 SREs, so operational complexity must be justified.
- Downstream consumers include both real-time dashboards and batch ML feature generation.