Interview Guides

Design ML-Driven OTA Rollout

Hard

ML System Design

Product Context

Meta operates a large fleet of connected devices across homes, offices, and edge deployments. Design an end-to-end ML system that decides which devices should receive an over-the-air firmware update, in what order, and how to minimize bricking risk from power failures while enforcing strong security guarantees.

Scale

Signal	Value
Fleet size	45M active devices
DAU-equivalent check-ins	38M devices/day
Peak update decision QPS	120K device check-ins/sec during rollout waves
Firmware variants	2,500 active hardware/region/carrier combinations
New firmware releases	40 major/minor releases per month
End-to-end decision latency budget	150ms p99
Daily update candidates	8M-12M devices

Task

Clarify the product goals and define functional and non-functional requirements for an ML-assisted OTA system.
Design a multi-stage decision pipeline for update eligibility, candidate retrieval, risk ranking, and final rollout policy.
Propose how the system handles interrupted updates, especially power loss during download, install, and reboot.
Define the security architecture: authenticity, integrity, anti-rollback, key management, and secure recovery.
Describe the offline and online data pipelines, model training cadence, and how you prevent training-serving skew.
Define evaluation, monitoring, rollback strategy, and top failure modes at fleet scale.

Constraints

Devices are heterogeneous: battery-powered and mains-powered, with intermittent connectivity and limited flash/storage.
Some devices check in only a few times per day; others stream telemetry continuously.
The online service must make rollout decisions within 150ms p99 and remain available during regional outages.
Regulatory/compliance constraints require signed firmware, auditable rollout decisions, and region-specific update policies.
Cost matters: avoid shipping large updates to devices unlikely to succeed on the first attempt.
The system must prefer safe degradation: if ML is unavailable, updates should fall back to deterministic policy rather than block critical security patches.

Design ML-Driven OTA Rollout

Hard

ML System Design

Product Context

Scale

Signal	Value
Fleet size	45M active devices
DAU-equivalent check-ins	38M devices/day
Peak update decision QPS	120K device check-ins/sec during rollout waves
Firmware variants	2,500 active hardware/region/carrier combinations
New firmware releases	40 major/minor releases per month
End-to-end decision latency budget	150ms p99
Daily update candidates	8M-12M devices

Task

Clarify the product goals and define functional and non-functional requirements for an ML-assisted OTA system.
Design a multi-stage decision pipeline for update eligibility, candidate retrieval, risk ranking, and final rollout policy.
Propose how the system handles interrupted updates, especially power loss during download, install, and reboot.
Define the security architecture: authenticity, integrity, anti-rollback, key management, and secure recovery.
Describe the offline and online data pipelines, model training cadence, and how you prevent training-serving skew.
Define evaluation, monitoring, rollback strategy, and top failure modes at fleet scale.

Constraints

Devices are heterogeneous: battery-powered and mains-powered, with intermittent connectivity and limited flash/storage.
Some devices check in only a few times per day; others stream telemetry continuously.
The online service must make rollout decisions within 150ms p99 and remain available during regional outages.
Regulatory/compliance constraints require signed firmware, auditable rollout decisions, and region-specific update policies.
Cost matters: avoid shipping large updates to devices unlikely to succeed on the first attempt.
The system must prefer safe degradation: if ML is unavailable, updates should fall back to deterministic policy rather than block critical security patches.

Your Answer

Design ML-Driven OTA Rollout

Hard

ML System Design

Product Context

Scale

Signal	Value
Fleet size	45M active devices
DAU-equivalent check-ins	38M devices/day
Peak update decision QPS	120K device check-ins/sec during rollout waves
Firmware variants	2,500 active hardware/region/carrier combinations
New firmware releases	40 major/minor releases per month
End-to-end decision latency budget	150ms p99
Daily update candidates	8M-12M devices

Task

Clarify the product goals and define functional and non-functional requirements for an ML-assisted OTA system.
Design a multi-stage decision pipeline for update eligibility, candidate retrieval, risk ranking, and final rollout policy.
Propose how the system handles interrupted updates, especially power loss during download, install, and reboot.
Define the security architecture: authenticity, integrity, anti-rollback, key management, and secure recovery.
Describe the offline and online data pipelines, model training cadence, and how you prevent training-serving skew.
Define evaluation, monitoring, rollback strategy, and top failure modes at fleet scale.

Constraints

Devices are heterogeneous: battery-powered and mains-powered, with intermittent connectivity and limited flash/storage.
Some devices check in only a few times per day; others stream telemetry continuously.
The online service must make rollout decisions within 150ms p99 and remain available during regional outages.
Regulatory/compliance constraints require signed firmware, auditable rollout decisions, and region-specific update policies.
Cost matters: avoid shipping large updates to devices unlikely to succeed on the first attempt.
The system must prefer safe degradation: if ML is unavailable, updates should fall back to deterministic policy rather than block critical security patches.

Design ML-Driven OTA Rollout

Hard

ML System Design

Product Context

Scale

Signal	Value
Fleet size	45M active devices
DAU-equivalent check-ins	38M devices/day
Peak update decision QPS	120K device check-ins/sec during rollout waves
Firmware variants	2,500 active hardware/region/carrier combinations
New firmware releases	40 major/minor releases per month
End-to-end decision latency budget	150ms p99
Daily update candidates	8M-12M devices

Task

Clarify the product goals and define functional and non-functional requirements for an ML-assisted OTA system.
Design a multi-stage decision pipeline for update eligibility, candidate retrieval, risk ranking, and final rollout policy.
Propose how the system handles interrupted updates, especially power loss during download, install, and reboot.
Define the security architecture: authenticity, integrity, anti-rollback, key management, and secure recovery.
Describe the offline and online data pipelines, model training cadence, and how you prevent training-serving skew.
Define evaluation, monitoring, rollback strategy, and top failure modes at fleet scale.

Constraints

Devices are heterogeneous: battery-powered and mains-powered, with intermittent connectivity and limited flash/storage.
Some devices check in only a few times per day; others stream telemetry continuously.
The online service must make rollout decisions within 150ms p99 and remain available during regional outages.
Regulatory/compliance constraints require signed firmware, auditable rollout decisions, and region-specific update policies.
Cost matters: avoid shipping large updates to devices unlikely to succeed on the first attempt.
The system must prefer safe degradation: if ML is unavailable, updates should fall back to deterministic policy rather than block critical security patches.