Design Production Observability Pipeline

Context

Meta operates a large-scale production service with telemetry emitted from service binaries, edge layers, and infrastructure hosts. The current setup relies on fragmented log and metric collection with service-local alerts, which causes delayed incident detection, inconsistent schemas, and high alert noise during regional failures.

Design a centralized observability and alerting pipeline using Meta-preferred surfaces such as Scribe for log transport, Scuba for interactive analytics, and internal service orchestration hooks. The goal is to ingest, validate, aggregate, and route telemetry into real-time alerting and longer-term analytical storage.

Scale Requirements

Telemetry volume: 15M events/second peak across logs, metrics, and traces
Payload size: 0.8-2.5 KB/event, mixed JSON and protobuf
Daily ingest: ~1.2 PB raw telemetry
Latency target: P95 under 30 seconds from emission to alert evaluation
Retention: 30 days hot, 180 days cold archive
Availability target: 99.99% for ingestion and alert evaluation

Requirements

Build a unified ingestion pipeline for logs, metrics, and trace-derived events from thousands of services.
Enforce schema validation, deduplication, tenant/service tagging, and PII redaction before downstream storage.
Support both streaming alert evaluation and batch backfills for rule changes or incident replay.
Route curated telemetry to low-latency serving for dashboards and alerts, plus cost-efficient long-term storage.
Provide idempotent reprocessing, replay from checkpoints, and dead-letter handling for malformed or oversized events.
Design monitoring for pipeline health, data quality, alert freshness, and false-positive spikes.
Support multi-region failover without losing more than 60 seconds of data.

Constraints

Must integrate with existing Meta telemetry producers and on-call workflows.
Cross-region replication cost must be controlled; not all raw data can stay in hot storage.
PII and security-sensitive fields must be masked before analyst-facing storage.
Rule updates should deploy without stopping ingestion or causing duplicate alerts.

Context

Scale Requirements

Telemetry volume: 15M events/second peak across logs, metrics, and traces
Payload size: 0.8-2.5 KB/event, mixed JSON and protobuf
Daily ingest: ~1.2 PB raw telemetry
Latency target: P95 under 30 seconds from emission to alert evaluation
Retention: 30 days hot, 180 days cold archive
Availability target: 99.99% for ingestion and alert evaluation

Requirements

Build a unified ingestion pipeline for logs, metrics, and trace-derived events from thousands of services.
Enforce schema validation, deduplication, tenant/service tagging, and PII redaction before downstream storage.
Support both streaming alert evaluation and batch backfills for rule changes or incident replay.
Route curated telemetry to low-latency serving for dashboards and alerts, plus cost-efficient long-term storage.
Provide idempotent reprocessing, replay from checkpoints, and dead-letter handling for malformed or oversized events.
Design monitoring for pipeline health, data quality, alert freshness, and false-positive spikes.
Support multi-region failover without losing more than 60 seconds of data.

Constraints

Must integrate with existing Meta telemetry producers and on-call workflows.
Cross-region replication cost must be controlled; not all raw data can stay in hot storage.
PII and security-sensitive fields must be masked before analyst-facing storage.
Rule updates should deploy without stopping ingestion or causing duplicate alerts.

Context

Scale Requirements

Telemetry volume: 15M events/second peak across logs, metrics, and traces
Payload size: 0.8-2.5 KB/event, mixed JSON and protobuf
Daily ingest: ~1.2 PB raw telemetry
Latency target: P95 under 30 seconds from emission to alert evaluation
Retention: 30 days hot, 180 days cold archive
Availability target: 99.99% for ingestion and alert evaluation

Requirements

Build a unified ingestion pipeline for logs, metrics, and trace-derived events from thousands of services.
Enforce schema validation, deduplication, tenant/service tagging, and PII redaction before downstream storage.
Support both streaming alert evaluation and batch backfills for rule changes or incident replay.
Route curated telemetry to low-latency serving for dashboards and alerts, plus cost-efficient long-term storage.
Provide idempotent reprocessing, replay from checkpoints, and dead-letter handling for malformed or oversized events.
Design monitoring for pipeline health, data quality, alert freshness, and false-positive spikes.
Support multi-region failover without losing more than 60 seconds of data.

Constraints

Must integrate with existing Meta telemetry producers and on-call workflows.
Cross-region replication cost must be controlled; not all raw data can stay in hot storage.
PII and security-sensitive fields must be masked before analyst-facing storage.
Rule updates should deploy without stopping ingestion or causing duplicate alerts.

Context

Scale Requirements

Telemetry volume: 15M events/second peak across logs, metrics, and traces
Payload size: 0.8-2.5 KB/event, mixed JSON and protobuf
Daily ingest: ~1.2 PB raw telemetry
Latency target: P95 under 30 seconds from emission to alert evaluation
Retention: 30 days hot, 180 days cold archive
Availability target: 99.99% for ingestion and alert evaluation

Requirements

Build a unified ingestion pipeline for logs, metrics, and trace-derived events from thousands of services.
Enforce schema validation, deduplication, tenant/service tagging, and PII redaction before downstream storage.
Support both streaming alert evaluation and batch backfills for rule changes or incident replay.
Route curated telemetry to low-latency serving for dashboards and alerts, plus cost-efficient long-term storage.
Provide idempotent reprocessing, replay from checkpoints, and dead-letter handling for malformed or oversized events.
Design monitoring for pipeline health, data quality, alert freshness, and false-positive spikes.
Support multi-region failover without losing more than 60 seconds of data.

Constraints

Must integrate with existing Meta telemetry producers and on-call workflows.
Cross-region replication cost must be controlled; not all raw data can stay in hot storage.
PII and security-sensitive fields must be masked before analyst-facing storage.
Rule updates should deploy without stopping ingestion or causing duplicate alerts.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Production Observability Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Production Observability Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Production Observability Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer