Monitor Freshness in Meta Pipelines

Context

Meta’s Ads Insights and internal analytics surfaces depend on event and dimension pipelines that land data from mobile/web logging, backend services, and partner feeds into a warehouse used by analysts and downstream ML systems. Today, several pipelines are marked “green” when jobs succeed, but consumers still see stale data because freshness is not measured end-to-end across ingestion, transformation, and serving.

Design a freshness monitoring framework for a mixed batch + streaming pipeline running on Apache Kafka, Apache Spark Structured Streaming, Apache Airflow, Hive/Presto, and Meta’s internal observability stack such as Scuba and ODS dashboards. The goal is to detect stale datasets before product teams notice broken dashboards or delayed model features.

Scale Requirements

Sources: 120 upstream datasets and 15 Kafka topics
Volume: 2.5M events/sec peak streaming, 180 TB/day batch ingest
Pipelines: 400 Airflow DAG runs/day, 60 streaming jobs
Freshness SLOs: P95 under 5 minutes for streaming tables; under 90 minutes for hourly batch tables
Retention: 30 days of freshness metrics, 1 year of incident audit logs

Requirements

Define how freshness is measured at each stage: source emission, Kafka ingestion, Spark processing, Hive table publish, and downstream serving.
Design metadata capture for event time, ingestion time, processing completion time, and publish time.
Support both streaming lag and batch lateness detection with dataset-specific SLOs.
Distinguish upstream data delay from pipeline failure, schema breakage, and low-volume periods.
Provide alerting, dashboards, and on-call workflows with actionable root-cause signals.
Support backfills without triggering false stale-data alerts.
Ensure the design is idempotent and does not materially increase pipeline latency.

Constraints

Existing storage is primarily Hive-backed tables queried through Presto
Teams cannot rewrite all producers; freshness instrumentation must be incremental
Monitoring cost should stay under 5% additional compute/storage overhead
Some datasets contain user data and must comply with Meta privacy and retention controls

Context

Scale Requirements

Sources: 120 upstream datasets and 15 Kafka topics
Volume: 2.5M events/sec peak streaming, 180 TB/day batch ingest
Pipelines: 400 Airflow DAG runs/day, 60 streaming jobs
Freshness SLOs: P95 under 5 minutes for streaming tables; under 90 minutes for hourly batch tables
Retention: 30 days of freshness metrics, 1 year of incident audit logs

Requirements

Define how freshness is measured at each stage: source emission, Kafka ingestion, Spark processing, Hive table publish, and downstream serving.
Design metadata capture for event time, ingestion time, processing completion time, and publish time.
Support both streaming lag and batch lateness detection with dataset-specific SLOs.
Distinguish upstream data delay from pipeline failure, schema breakage, and low-volume periods.
Provide alerting, dashboards, and on-call workflows with actionable root-cause signals.
Support backfills without triggering false stale-data alerts.
Ensure the design is idempotent and does not materially increase pipeline latency.

Constraints

Existing storage is primarily Hive-backed tables queried through Presto
Teams cannot rewrite all producers; freshness instrumentation must be incremental
Monitoring cost should stay under 5% additional compute/storage overhead
Some datasets contain user data and must comply with Meta privacy and retention controls

Context

Scale Requirements

Sources: 120 upstream datasets and 15 Kafka topics
Volume: 2.5M events/sec peak streaming, 180 TB/day batch ingest
Pipelines: 400 Airflow DAG runs/day, 60 streaming jobs
Freshness SLOs: P95 under 5 minutes for streaming tables; under 90 minutes for hourly batch tables
Retention: 30 days of freshness metrics, 1 year of incident audit logs

Requirements

Define how freshness is measured at each stage: source emission, Kafka ingestion, Spark processing, Hive table publish, and downstream serving.
Design metadata capture for event time, ingestion time, processing completion time, and publish time.
Support both streaming lag and batch lateness detection with dataset-specific SLOs.
Distinguish upstream data delay from pipeline failure, schema breakage, and low-volume periods.
Provide alerting, dashboards, and on-call workflows with actionable root-cause signals.
Support backfills without triggering false stale-data alerts.
Ensure the design is idempotent and does not materially increase pipeline latency.

Constraints

Existing storage is primarily Hive-backed tables queried through Presto
Teams cannot rewrite all producers; freshness instrumentation must be incremental
Monitoring cost should stay under 5% additional compute/storage overhead
Some datasets contain user data and must comply with Meta privacy and retention controls

Context

Scale Requirements

Sources: 120 upstream datasets and 15 Kafka topics
Volume: 2.5M events/sec peak streaming, 180 TB/day batch ingest
Pipelines: 400 Airflow DAG runs/day, 60 streaming jobs
Freshness SLOs: P95 under 5 minutes for streaming tables; under 90 minutes for hourly batch tables
Retention: 30 days of freshness metrics, 1 year of incident audit logs

Requirements

Define how freshness is measured at each stage: source emission, Kafka ingestion, Spark processing, Hive table publish, and downstream serving.
Design metadata capture for event time, ingestion time, processing completion time, and publish time.
Support both streaming lag and batch lateness detection with dataset-specific SLOs.
Distinguish upstream data delay from pipeline failure, schema breakage, and low-volume periods.
Provide alerting, dashboards, and on-call workflows with actionable root-cause signals.
Support backfills without triggering false stale-data alerts.
Ensure the design is idempotent and does not materially increase pipeline latency.

Constraints

Existing storage is primarily Hive-backed tables queried through Presto
Teams cannot rewrite all producers; freshness instrumentation must be incremental
Monitoring cost should stay under 5% additional compute/storage overhead
Some datasets contain user data and must comply with Meta privacy and retention controls

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Monitor Freshness in Meta Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer

Monitor Freshness in Meta Pipelines

Context

Scale Requirements

Requirements

Constraints

Monitor Freshness in Meta Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer