Product Context
Meta operates a large fleet of connected devices across homes, offices, and edge deployments. Design an end-to-end ML system that decides which devices should receive an over-the-air firmware update, in what order, and how to minimize bricking risk from power failures while enforcing strong security guarantees.
Scale
| Signal | Value |
|---|
| Fleet size | 45M active devices |
| DAU-equivalent check-ins | 38M devices/day |
| Peak update decision QPS | 120K device check-ins/sec during rollout waves |
| Firmware variants | 2,500 active hardware/region/carrier combinations |
| New firmware releases | 40 major/minor releases per month |
| End-to-end decision latency budget | 150ms p99 |
| Daily update candidates | 8M-12M devices |
Task
- Clarify the product goals and define functional and non-functional requirements for an ML-assisted OTA system.
- Design a multi-stage decision pipeline for update eligibility, candidate retrieval, risk ranking, and final rollout policy.
- Propose how the system handles interrupted updates, especially power loss during download, install, and reboot.
- Define the security architecture: authenticity, integrity, anti-rollback, key management, and secure recovery.
- Describe the offline and online data pipelines, model training cadence, and how you prevent training-serving skew.
- Define evaluation, monitoring, rollback strategy, and top failure modes at fleet scale.
Constraints
- Devices are heterogeneous: battery-powered and mains-powered, with intermittent connectivity and limited flash/storage.
- Some devices check in only a few times per day; others stream telemetry continuously.
- The online service must make rollout decisions within 150ms p99 and remain available during regional outages.
- Regulatory/compliance constraints require signed firmware, auditable rollout decisions, and region-specific update policies.
- Cost matters: avoid shipping large updates to devices unlikely to succeed on the first attempt.
- The system must prefer safe degradation: if ML is unavailable, updates should fall back to deterministic policy rather than block critical security patches.