Context
Meta’s analytics platform ingests client and server events from Facebook and Instagram surfaces into a near-real-time pipeline used for product dashboards and experiment reads. The current stream path publishes fast aggregates, but mobile offline behavior, network retries, and regional outages cause events to arrive minutes or hours late, leading to undercounted metrics and inconsistent downstream tables.
You are asked to design a late-data handling strategy for a streaming pipeline built on Apache Kafka, Apache Flink, HDFS, and Presto, orchestrated by Apache Airflow, while fitting into Meta-style warehouse patterns and operational expectations.
Scale Requirements
- Ingress: 2.5M events/sec peak, 800K avg
- Event size: 1.2 KB average Avro payload
- Daily volume: ~200B events/day, ~240 TB raw/day
- Freshness target: P95 event-to-queryable latency under 3 minutes
- Late data profile: 92% within 5 minutes, 7% within 2 hours, 1% up to 7 days late
- Retention: 30 days raw stream replay, 180 days curated warehouse tables
Requirements
- Design event-time processing with watermarks to produce correct windowed aggregates despite late arrivals.
- Support upserts/retractions so downstream fact tables and metric tables can be corrected when late events arrive.
- Ensure idempotent handling of duplicates caused by client retries and producer replays.
- Route events arriving beyond the allowed lateness threshold into a backfill path and describe how they are merged.
- Define monitoring for lateness distribution, watermark lag, state growth, and data quality regressions.
- Explain how batch backfills and stream processing remain consistent for the same business logic.
Constraints
- Do not increase median query latency for dashboards beyond current SLA.
- Stateful operators must remain bounded and recoverable during regional failover.
- Assume schema evolution is frequent across mobile clients.
- Cost increase should stay under 20% versus the current streaming footprint.