Context
Meta’s Ads Insights platform currently computes most advertiser reporting metrics in scheduled Hive/Spark batch jobs on data stored in HDFS-backed warehouses. Product teams now want fresher delivery and pacing metrics for surfaces such as Facebook Feed, Instagram, and Reels, but not every dataset justifies the operational cost of a streaming architecture.
Design a pipeline strategy for ad impression, click, conversion, and budget events, and explain when you would choose batch, streaming, or a hybrid Lambda/Kappa-style approach for different downstream consumers.
Scale Requirements
- Ingress: 8M events/sec peak globally across impression, click, spend, and conversion logs
- Event size: 1-3 KB Avro/JSON payloads
- Daily volume: ~35 PB raw logs/month
- Freshness targets:
- Budget pacing and anomaly detection: < 60 seconds
- Internal operational dashboards: < 5 minutes
- Advertiser billing and finance reconciliation: T+1 day, accuracy-first
- Retention: 180 days raw, 7 years for finance-grade aggregates
Requirements
- Design separate paths for low-latency operational use cases and high-accuracy financial reporting.
- Define which datasets should remain batch, which should be streaming, and why.
- Support late-arriving conversion events, deduplication, replay, and backfills.
- Include schema evolution, idempotent writes, and data quality validation.
- Orchestrate dependencies between stream processors, warehouse loads, and daily reconciliation jobs.
- Expose analytics-ready tables for Presto/Trino-style querying and internal dashboards.
Constraints
- Existing Meta stack heavily uses Scribe, Apache Kafka/Pulsar-like logs, Spark, Presto, Hive, and workflow orchestration similar to Airflow.
- Finance datasets require exactly-once semantics at the aggregate level and auditable reprocessing.
- Incremental infrastructure cost should stay below $300K/month.
- Cross-region failover must preserve no worse than 5 minutes RPO for streaming consumers.
- GDPR/CCPA deletion requests must propagate to both raw and derived stores.