Product Context
Meta runs a large internal microservices platform where thousands of services consume secrets from a centralized secrets management system. Design an ML-driven control plane that prioritizes secret rotation, detects risky access patterns, and decides when to trigger immediate rotation versus scheduled rotation with minimal service disruption.
Scale
| Signal | Value |
|---|
| Internal services | 120,000 |
| Secrets under management | 1.8B active secrets |
| Secret reads/day | 45B |
| Secret write / rotation events/day | 220M |
| Peak policy-evaluation QPS | 650K |
| Peak rotation-decision QPS | 90K |
| End-to-end online latency budget | 120ms p99 |
| Regions | 12 global regions |
Task
- Clarify the product goals, threat model, and success metrics for an ML-based secret rotation system.
- Design the end-to-end architecture, including candidate generation, risk ranking, and final re-ranking / policy enforcement before rotation.
- Choose models and features for each stage, and explain what runs online versus batch.
- Define the data pipeline, labels, training cadence, and how to avoid training-serving skew.
- Propose offline and online evaluation, including security guardrails and rollout strategy.
- Identify key failure modes, especially around feature drift, false positives, stale features, and regional outages.
Constraints
- The system must integrate with Meta-style internal service identity, audit logging, and regional control planes.
- Some secrets back critical production paths and cannot be rotated synchronously without a fallback credential path.
- Access logs are high-volume and partially delayed across regions by up to 3 minutes.
- Security decisions must be explainable to infra and service owners.
- Cost matters: only the highest-risk 0.5% of secrets can trigger expensive immediate rotation workflows each day.