You are designing a real-time pipeline for an advertising platform that needs up-to-date click and impression aggregates for pacing, billing, and advertiser reporting. Today, event data lands in hourly files and downstream teams regularly see mismatches between delivery dashboards and finance reports. The gap has become an executive escalation because campaign managers need minute-level visibility during peak spend windows, while finance requires deduplicated, auditable counts. You need to support low-latency aggregation without losing the ability to reprocess historical data when schemas or attribution rules change.
| Component | Status / Technology |
|---|---|
| Event Producers | Mobile clients, web clients, ad delivery servers |
| Ingestion | Snap Pub/Sub edge collectors writing JSON events |
| Processing | Hourly Spark batch jobs |
| Storage | S3 data lake + Snowflake warehouse |
| Orchestration | Apache Airflow 2.x |
| Consumption | Internal dashboards, billing exports, campaign pacing services |
Scale: ~220K events/sec average, 900K peak during major campaigns, ~9B events/day, 1-2 KB per event, 15-minute freshness target for aggregates, 13-month retention, late events common within 2 hours.
How would you design the end-to-end real-time event processing pipeline so clicks and impressions are aggregated accurately and queryable with low latency, while still supporting replay, backfills, and reconciliation against downstream billing data?