Context
Meta operates a large-scale production service with telemetry emitted from service binaries, edge layers, and infrastructure hosts. The current setup relies on fragmented log and metric collection with service-local alerts, which causes delayed incident detection, inconsistent schemas, and high alert noise during regional failures.
Design a centralized observability and alerting pipeline using Meta-preferred surfaces such as Scribe for log transport, Scuba for interactive analytics, and internal service orchestration hooks. The goal is to ingest, validate, aggregate, and route telemetry into real-time alerting and longer-term analytical storage.
Scale Requirements
- Telemetry volume: 15M events/second peak across logs, metrics, and traces
- Payload size: 0.8-2.5 KB/event, mixed JSON and protobuf
- Daily ingest: ~1.2 PB raw telemetry
- Latency target: P95 under 30 seconds from emission to alert evaluation
- Retention: 30 days hot, 180 days cold archive
- Availability target: 99.99% for ingestion and alert evaluation
Requirements
- Build a unified ingestion pipeline for logs, metrics, and trace-derived events from thousands of services.
- Enforce schema validation, deduplication, tenant/service tagging, and PII redaction before downstream storage.
- Support both streaming alert evaluation and batch backfills for rule changes or incident replay.
- Route curated telemetry to low-latency serving for dashboards and alerts, plus cost-efficient long-term storage.
- Provide idempotent reprocessing, replay from checkpoints, and dead-letter handling for malformed or oversized events.
- Design monitoring for pipeline health, data quality, alert freshness, and false-positive spikes.
- Support multi-region failover without losing more than 60 seconds of data.
Constraints
- Must integrate with existing Meta telemetry producers and on-call workflows.
- Cross-region replication cost must be controlled; not all raw data can stay in hot storage.
- PII and security-sensitive fields must be masked before analyst-facing storage.
- Rule updates should deploy without stopping ingestion or causing duplicate alerts.