Project Background
At Meta, a critical remote-code-execution vulnerability has been confirmed in a shared internal library used by 500 production services across Facebook, Instagram, Messenger, and Ads systems. The vulnerable library is embedded in both user-facing and backend workloads, and Security has assessed active exploitation as plausible within days if details leak internally or externally.
You are the Security Engineer acting as the execution lead for the remediation program. You have one central security response team of 8 engineers, but service ownership is distributed across roughly 120 engineering teams. The expectation is to drive a coordinated rollout using Meta’s internal source control, CI/CD, and service management tooling, while minimizing user impact and avoiding broad production instability.
Key Stakeholders
Security Engineering wants the fastest possible remediation and temporary mitigations on any service that cannot patch immediately. Infra/SRE wants rollout safety and rollback readiness because many of these services are latency-sensitive. Product engineering teams want to avoid interrupting planned launches, especially for high-priority Ads and Reels initiatives. Legal and Communications want strict need-to-know handling until exposure is reduced.
Constraints
- You have 10 calendar days to reduce vulnerable production coverage from 500 services to under 5%.
- Only 8 dedicated security engineers are available full-time; no new headcount is approved.
- About 70 services are Tier-0/Tier-1 and cannot tolerate more than 15 minutes of degraded availability.
- 90 services have custom forks or pinned dependencies that prevent an automatic version bump.
- The emergency remediation budget is $350K, mainly for temporary staffing, extended on-call coverage, and validation tooling.
Complications
- A patch exists, but it changes one API behavior and may break backward compatibility in some call paths.
- Two major product launches are scheduled in the next week, and their engineering leads are resisting freeze requests.
- Asset inventory is incomplete: Security is only 85% confident the current dependency graph captures all affected services.
Your Task
- Build a rollout strategy that prioritizes services, owners, and sequencing.
- Define the governance model, communication plan, and decision-making cadence.
- Propose validation, canary, rollback, and exception-handling mechanisms.
- Explain how you would handle incomplete inventory and teams that miss deadlines.
- Define success metrics for remediation progress and post-rollout stability.