Handle Late Data in Streams

Context

Meta’s analytics platform ingests client and server events from Facebook and Instagram surfaces into a near-real-time pipeline used for product dashboards and experiment reads. The current stream path publishes fast aggregates, but mobile offline behavior, network retries, and regional outages cause events to arrive minutes or hours late, leading to undercounted metrics and inconsistent downstream tables.

You are asked to design a late-data handling strategy for a streaming pipeline built on Apache Kafka, Apache Flink, HDFS, and Presto, orchestrated by Apache Airflow, while fitting into Meta-style warehouse patterns and operational expectations.

Scale Requirements

Ingress: 2.5M events/sec peak, 800K avg
Event size: 1.2 KB average Avro payload
Daily volume: ~200B events/day, ~240 TB raw/day
Freshness target: P95 event-to-queryable latency under 3 minutes
Late data profile: 92% within 5 minutes, 7% within 2 hours, 1% up to 7 days late
Retention: 30 days raw stream replay, 180 days curated warehouse tables

Requirements

Design event-time processing with watermarks to produce correct windowed aggregates despite late arrivals.
Support upserts/retractions so downstream fact tables and metric tables can be corrected when late events arrive.
Ensure idempotent handling of duplicates caused by client retries and producer replays.
Route events arriving beyond the allowed lateness threshold into a backfill path and describe how they are merged.
Define monitoring for lateness distribution, watermark lag, state growth, and data quality regressions.
Explain how batch backfills and stream processing remain consistent for the same business logic.

Constraints

Do not increase median query latency for dashboards beyond current SLA.
Stateful operators must remain bounded and recoverable during regional failover.
Assume schema evolution is frequent across mobile clients.
Cost increase should stay under 20% versus the current streaming footprint.

Context

Scale Requirements

Ingress: 2.5M events/sec peak, 800K avg
Event size: 1.2 KB average Avro payload
Daily volume: ~200B events/day, ~240 TB raw/day
Freshness target: P95 event-to-queryable latency under 3 minutes
Late data profile: 92% within 5 minutes, 7% within 2 hours, 1% up to 7 days late
Retention: 30 days raw stream replay, 180 days curated warehouse tables

Requirements

Design event-time processing with watermarks to produce correct windowed aggregates despite late arrivals.
Support upserts/retractions so downstream fact tables and metric tables can be corrected when late events arrive.
Ensure idempotent handling of duplicates caused by client retries and producer replays.
Route events arriving beyond the allowed lateness threshold into a backfill path and describe how they are merged.
Define monitoring for lateness distribution, watermark lag, state growth, and data quality regressions.
Explain how batch backfills and stream processing remain consistent for the same business logic.

Constraints

Do not increase median query latency for dashboards beyond current SLA.
Stateful operators must remain bounded and recoverable during regional failover.
Assume schema evolution is frequent across mobile clients.
Cost increase should stay under 20% versus the current streaming footprint.

Context

Scale Requirements

Ingress: 2.5M events/sec peak, 800K avg
Event size: 1.2 KB average Avro payload
Daily volume: ~200B events/day, ~240 TB raw/day
Freshness target: P95 event-to-queryable latency under 3 minutes
Late data profile: 92% within 5 minutes, 7% within 2 hours, 1% up to 7 days late
Retention: 30 days raw stream replay, 180 days curated warehouse tables

Requirements

Design event-time processing with watermarks to produce correct windowed aggregates despite late arrivals.
Support upserts/retractions so downstream fact tables and metric tables can be corrected when late events arrive.
Ensure idempotent handling of duplicates caused by client retries and producer replays.
Route events arriving beyond the allowed lateness threshold into a backfill path and describe how they are merged.
Define monitoring for lateness distribution, watermark lag, state growth, and data quality regressions.
Explain how batch backfills and stream processing remain consistent for the same business logic.

Constraints

Do not increase median query latency for dashboards beyond current SLA.
Stateful operators must remain bounded and recoverable during regional failover.
Assume schema evolution is frequent across mobile clients.
Cost increase should stay under 20% versus the current streaming footprint.

Context

Scale Requirements

Ingress: 2.5M events/sec peak, 800K avg
Event size: 1.2 KB average Avro payload
Daily volume: ~200B events/day, ~240 TB raw/day
Freshness target: P95 event-to-queryable latency under 3 minutes
Late data profile: 92% within 5 minutes, 7% within 2 hours, 1% up to 7 days late
Retention: 30 days raw stream replay, 180 days curated warehouse tables

Requirements

Design event-time processing with watermarks to produce correct windowed aggregates despite late arrivals.
Support upserts/retractions so downstream fact tables and metric tables can be corrected when late events arrive.
Ensure idempotent handling of duplicates caused by client retries and producer replays.
Route events arriving beyond the allowed lateness threshold into a backfill path and describe how they are merged.
Define monitoring for lateness distribution, watermark lag, state growth, and data quality regressions.
Explain how batch backfills and stream processing remain consistent for the same business logic.

Constraints

Do not increase median query latency for dashboards beyond current SLA.
Stateful operators must remain bounded and recoverable during regional failover.
Assume schema evolution is frequent across mobile clients.
Cost increase should stay under 20% versus the current streaming footprint.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Handle Late Data in Streams

Context

Scale Requirements

Requirements

Constraints

Your Answer

Handle Late Data in Streams

Context

Scale Requirements

Requirements

Constraints

Handle Late Data in Streams

Context

Scale Requirements

Requirements

Constraints

Your Answer