Context
Meta’s internal observability teams need a deployment pipeline for a metrics ingestion system that feeds operational dashboards used by service owners. The current setup relies on manually coordinated batch and streaming job releases across Airflow, Spark, and warehouse layers, causing schema drift, duplicate loads, and rollback delays during production incidents.
Design a deployment-aware data pipeline architecture for a DevOps engineer supporting Meta-scale telemetry ingestion. Focus on how deployment concepts—versioning, rollback, blue/green or canary rollout, idempotent reprocessing, dependency management, and observability—should be applied to ETL and stream-processing systems rather than stateless web services.
Scale Requirements
- Ingress: 1.2M telemetry events/sec peak from services and hosts
- Batch backfill: up to 40 TB/day historical replay
- Latency: P95 under 2 minutes from ingestion to queryable aggregates
- Storage: 8 PB retained raw data, 180-day hot query window
- Availability: 99.95% for production dashboards
Requirements
- Design a deployment pipeline for streaming and batch jobs using Apache Airflow 2.x, Apache Spark Structured Streaming 3.x, Apache Kafka 3.x, and Presto over a Hive-compatible lake.
- Support safe rollout of schema changes and transformation logic without breaking downstream consumers.
- Ensure idempotent processing for retries, replay, and backfills.
- Define promotion stages: dev, staging, canary, and production, with automated validation gates.
- Include orchestration for dependency ordering across Kafka topics, Spark jobs, Airflow DAGs, and dbt-like SQL transforms.
- Specify monitoring, alerting, rollback, and disaster recovery procedures.
- Include one example of stream-processing deployment logic and one orchestration/config snippet.
Constraints
- Prefer Meta-specific deployment surfaces where appropriate, such as Tupperware for containerized job rollout and Scuba for operational monitoring.
- No full pipeline downtime during deployment.
- Compliance requires auditability of code version, schema version, and data lineage for every production run.
- Team size is small: 3 DevOps engineers supporting 20+ pipelines, so operational simplicity matters.