Deploy Meta Metrics Pipeline Safely

Context

Meta’s internal observability teams need a deployment pipeline for a metrics ingestion system that feeds operational dashboards used by service owners. The current setup relies on manually coordinated batch and streaming job releases across Airflow, Spark, and warehouse layers, causing schema drift, duplicate loads, and rollback delays during production incidents.

Design a deployment-aware data pipeline architecture for a DevOps engineer supporting Meta-scale telemetry ingestion. Focus on how deployment concepts—versioning, rollback, blue/green or canary rollout, idempotent reprocessing, dependency management, and observability—should be applied to ETL and stream-processing systems rather than stateless web services.

Scale Requirements

Ingress: 1.2M telemetry events/sec peak from services and hosts
Batch backfill: up to 40 TB/day historical replay
Latency: P95 under 2 minutes from ingestion to queryable aggregates
Storage: 8 PB retained raw data, 180-day hot query window
Availability: 99.95% for production dashboards

Requirements

Design a deployment pipeline for streaming and batch jobs using Apache Airflow 2.x, Apache Spark Structured Streaming 3.x, Apache Kafka 3.x, and Presto over a Hive-compatible lake.
Support safe rollout of schema changes and transformation logic without breaking downstream consumers.
Ensure idempotent processing for retries, replay, and backfills.
Define promotion stages: dev, staging, canary, and production, with automated validation gates.
Include orchestration for dependency ordering across Kafka topics, Spark jobs, Airflow DAGs, and dbt-like SQL transforms.
Specify monitoring, alerting, rollback, and disaster recovery procedures.
Include one example of stream-processing deployment logic and one orchestration/config snippet.

Constraints

Prefer Meta-specific deployment surfaces where appropriate, such as Tupperware for containerized job rollout and Scuba for operational monitoring.
No full pipeline downtime during deployment.
Compliance requires auditability of code version, schema version, and data lineage for every production run.
Team size is small: 3 DevOps engineers supporting 20+ pipelines, so operational simplicity matters.

Context

Scale Requirements

Ingress: 1.2M telemetry events/sec peak from services and hosts
Batch backfill: up to 40 TB/day historical replay
Latency: P95 under 2 minutes from ingestion to queryable aggregates
Storage: 8 PB retained raw data, 180-day hot query window
Availability: 99.95% for production dashboards

Requirements

Design a deployment pipeline for streaming and batch jobs using Apache Airflow 2.x, Apache Spark Structured Streaming 3.x, Apache Kafka 3.x, and Presto over a Hive-compatible lake.
Support safe rollout of schema changes and transformation logic without breaking downstream consumers.
Ensure idempotent processing for retries, replay, and backfills.
Define promotion stages: dev, staging, canary, and production, with automated validation gates.
Include orchestration for dependency ordering across Kafka topics, Spark jobs, Airflow DAGs, and dbt-like SQL transforms.
Specify monitoring, alerting, rollback, and disaster recovery procedures.
Include one example of stream-processing deployment logic and one orchestration/config snippet.

Constraints

Prefer Meta-specific deployment surfaces where appropriate, such as Tupperware for containerized job rollout and Scuba for operational monitoring.
No full pipeline downtime during deployment.
Compliance requires auditability of code version, schema version, and data lineage for every production run.
Team size is small: 3 DevOps engineers supporting 20+ pipelines, so operational simplicity matters.

Context

Scale Requirements

Ingress: 1.2M telemetry events/sec peak from services and hosts
Batch backfill: up to 40 TB/day historical replay
Latency: P95 under 2 minutes from ingestion to queryable aggregates
Storage: 8 PB retained raw data, 180-day hot query window
Availability: 99.95% for production dashboards

Requirements

Design a deployment pipeline for streaming and batch jobs using Apache Airflow 2.x, Apache Spark Structured Streaming 3.x, Apache Kafka 3.x, and Presto over a Hive-compatible lake.
Support safe rollout of schema changes and transformation logic without breaking downstream consumers.
Ensure idempotent processing for retries, replay, and backfills.
Define promotion stages: dev, staging, canary, and production, with automated validation gates.
Include orchestration for dependency ordering across Kafka topics, Spark jobs, Airflow DAGs, and dbt-like SQL transforms.
Specify monitoring, alerting, rollback, and disaster recovery procedures.
Include one example of stream-processing deployment logic and one orchestration/config snippet.

Constraints

Prefer Meta-specific deployment surfaces where appropriate, such as Tupperware for containerized job rollout and Scuba for operational monitoring.
No full pipeline downtime during deployment.
Compliance requires auditability of code version, schema version, and data lineage for every production run.
Team size is small: 3 DevOps engineers supporting 20+ pipelines, so operational simplicity matters.

Context

Scale Requirements

Ingress: 1.2M telemetry events/sec peak from services and hosts
Batch backfill: up to 40 TB/day historical replay
Latency: P95 under 2 minutes from ingestion to queryable aggregates
Storage: 8 PB retained raw data, 180-day hot query window
Availability: 99.95% for production dashboards

Requirements

Design a deployment pipeline for streaming and batch jobs using Apache Airflow 2.x, Apache Spark Structured Streaming 3.x, Apache Kafka 3.x, and Presto over a Hive-compatible lake.
Support safe rollout of schema changes and transformation logic without breaking downstream consumers.
Ensure idempotent processing for retries, replay, and backfills.
Define promotion stages: dev, staging, canary, and production, with automated validation gates.
Include orchestration for dependency ordering across Kafka topics, Spark jobs, Airflow DAGs, and dbt-like SQL transforms.
Specify monitoring, alerting, rollback, and disaster recovery procedures.
Include one example of stream-processing deployment logic and one orchestration/config snippet.

Constraints

Prefer Meta-specific deployment surfaces where appropriate, such as Tupperware for containerized job rollout and Scuba for operational monitoring.
No full pipeline downtime during deployment.
Compliance requires auditability of code version, schema version, and data lineage for every production run.
Team size is small: 3 DevOps engineers supporting 20+ pipelines, so operational simplicity matters.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Deploy Meta Metrics Pipeline Safely

Context

Scale Requirements

Requirements

Constraints

Your Answer

Deploy Meta Metrics Pipeline Safely

Context

Scale Requirements

Requirements

Constraints

Deploy Meta Metrics Pipeline Safely

Context

Scale Requirements

Requirements

Constraints

Your Answer