Context
Meta’s Ads Insights team ingests delivery, click, and conversion events from multiple internal producers into a shared analytics pipeline used by Ads Manager and internal reporting. The current pipeline is vulnerable to duplicate processing during task retries, late backfills, and stream replays, causing metric inflation and inconsistent dashboards.
You need to design an idempotent pipeline so that rerunning a job, replaying a stream, or backfilling historical partitions does not change final results unless the source data itself changed.
Scale Requirements
- Throughput: 2M events/second peak across delivery, click, and conversion streams
- Batch volume: 25 TB/day raw logs, 180 billion rows/month
- Latency target: streaming aggregates available in < 3 minutes; batch corrections within 2 hours
- Retention: raw logs for 180 days, curated aggregates for 2 years
- Backfill scale: up to 30 days of replay without double counting
Requirements
- Design a pipeline that guarantees idempotent writes for both streaming and batch ingestion paths.
- Define how to generate and persist stable deduplication keys across retries, replays, and upstream resends.
- Support late-arriving events, partition reprocessing, and historical backfills without corrupting downstream aggregates.
- Produce daily and hourly Ads metrics tables consumable by Presto/Trino and internal dashboards.
- Explain how orchestration should safely retry failed tasks in Apache Airflow without duplicate side effects.
- Include data quality checks for duplicate rate, missing partitions, and reconciliation between raw and curated layers.
- Describe monitoring, alerting, and recovery procedures for partial loads and exactly-once failures.
Constraints
- Existing storage is HDFS-backed data lake plus Hive/Presto query surfaces; migration must be incremental.
- Upstream producers are not perfectly reliable and may resend the same event with the same
event_id.
- Budget does not allow full-table rewrites for every retry.
- Compliance requires auditability of every replay, correction, and delete request.