Context
Meta’s Ads Insights and internal analytics surfaces depend on event and dimension pipelines that land data from mobile/web logging, backend services, and partner feeds into a warehouse used by analysts and downstream ML systems. Today, several pipelines are marked “green” when jobs succeed, but consumers still see stale data because freshness is not measured end-to-end across ingestion, transformation, and serving.
Design a freshness monitoring framework for a mixed batch + streaming pipeline running on Apache Kafka, Apache Spark Structured Streaming, Apache Airflow, Hive/Presto, and Meta’s internal observability stack such as Scuba and ODS dashboards. The goal is to detect stale datasets before product teams notice broken dashboards or delayed model features.
Scale Requirements
- Sources: 120 upstream datasets and 15 Kafka topics
- Volume: 2.5M events/sec peak streaming, 180 TB/day batch ingest
- Pipelines: 400 Airflow DAG runs/day, 60 streaming jobs
- Freshness SLOs: P95 under 5 minutes for streaming tables; under 90 minutes for hourly batch tables
- Retention: 30 days of freshness metrics, 1 year of incident audit logs
Requirements
- Define how freshness is measured at each stage: source emission, Kafka ingestion, Spark processing, Hive table publish, and downstream serving.
- Design metadata capture for event time, ingestion time, processing completion time, and publish time.
- Support both streaming lag and batch lateness detection with dataset-specific SLOs.
- Distinguish upstream data delay from pipeline failure, schema breakage, and low-volume periods.
- Provide alerting, dashboards, and on-call workflows with actionable root-cause signals.
- Support backfills without triggering false stale-data alerts.
- Ensure the design is idempotent and does not materially increase pipeline latency.
Constraints
- Existing storage is primarily Hive-backed tables queried through Presto
- Teams cannot rewrite all producers; freshness instrumentation must be incremental
- Monitoring cost should stay under 5% additional compute/storage overhead
- Some datasets contain user data and must comply with Meta privacy and retention controls