Product Context
Design the ML-backed impression tracking and validation system for Meta ads shown across Facebook Feed, Instagram, Reels, and Audience Network. The system is used by ad delivery, billing, measurement, and integrity teams to decide whether an ad impression should be counted in real time, with low latency and clear handling of duplicates, delayed events, and partial outages.
Scale
| Signal | Value |
|---|
| DAU across surfaces | 2.8B |
| Peak ad render events | 18M QPS |
| Peak billable impression decisions | 9M QPS |
| Ad creatives / active campaigns | 150M creatives / 25M active campaigns |
| Event retention | 90 days hot, 1 year cold |
| End-to-end p99 decision latency | 50ms |
| Availability target | 99.99% |
The system must support real-time counting for pacing and reporting, while also producing high-quality labels for downstream ML models that detect invalid traffic, duplicate impressions, and instrumentation bugs. Assume some feedback signals are delayed by minutes to hours, and some surfaces may emit incomplete client logs.
Deliverables
- Define the functional and non-functional requirements, including what qualifies as a valid impression versus a dropped, duplicate, or uncertain event.
- Estimate system scale, storage, and online serving capacity for peak traffic, feature lookups, and model inference.
- Propose an end-to-end architecture covering event ingestion, candidate validation, ML scoring, deduplication, aggregation, and downstream reporting.
- Choose models for each stage (fast filtering, risk scoring, optional re-evaluation) and explain online vs. batch decisions.
- Describe the training data pipeline, label construction, offline/online evaluation, and experimentation strategy.
- Identify major failure modes, especially feature drift, training-serving skew, logging loss, and regional outages, with detection and mitigation plans.
Constraints
- Real-time decisioning must not materially slow ad delivery; p99 budget is 50ms including network overhead.
- Billing and advertiser reporting require auditable event lineage and idempotent processing.
- Some features are only available asynchronously; the online model must degrade gracefully.
- Privacy and compliance constraints limit raw user-level retention and require regional data handling.
- The system should prefer availability over perfect precision during transient failures, but must clearly mark uncertain impressions for later reconciliation.