Interview Guides

Design ML-Powered Secret Rotation Prioritizer

Hard

ML System Design

Product Context

Meta runs a large internal microservices platform where thousands of services consume secrets from a centralized secrets management system. Design an ML-driven control plane that prioritizes secret rotation, detects risky access patterns, and decides when to trigger immediate rotation versus scheduled rotation with minimal service disruption.

Scale

Signal	Value
Internal services	120,000
Secrets under management	1.8B active secrets
Secret reads/day	45B
Secret write / rotation events/day	220M
Peak policy-evaluation QPS	650K
Peak rotation-decision QPS	90K
End-to-end online latency budget	120ms p99
Regions	12 global regions

Task

Clarify the product goals, threat model, and success metrics for an ML-based secret rotation system.
Design the end-to-end architecture, including candidate generation, risk ranking, and final re-ranking / policy enforcement before rotation.
Choose models and features for each stage, and explain what runs online versus batch.
Define the data pipeline, labels, training cadence, and how to avoid training-serving skew.
Propose offline and online evaluation, including security guardrails and rollout strategy.
Identify key failure modes, especially around feature drift, false positives, stale features, and regional outages.

Constraints

The system must integrate with Meta-style internal service identity, audit logging, and regional control planes.
Some secrets back critical production paths and cannot be rotated synchronously without a fallback credential path.
Access logs are high-volume and partially delayed across regions by up to 3 minutes.
Security decisions must be explainable to infra and service owners.
Cost matters: only the highest-risk 0.5% of secrets can trigger expensive immediate rotation workflows each day.

Design ML-Powered Secret Rotation Prioritizer

Hard

ML System Design

Product Context

Scale

Signal	Value
Internal services	120,000
Secrets under management	1.8B active secrets
Secret reads/day	45B
Secret write / rotation events/day	220M
Peak policy-evaluation QPS	650K
Peak rotation-decision QPS	90K
End-to-end online latency budget	120ms p99
Regions	12 global regions

Task

Clarify the product goals, threat model, and success metrics for an ML-based secret rotation system.
Design the end-to-end architecture, including candidate generation, risk ranking, and final re-ranking / policy enforcement before rotation.
Choose models and features for each stage, and explain what runs online versus batch.
Define the data pipeline, labels, training cadence, and how to avoid training-serving skew.
Propose offline and online evaluation, including security guardrails and rollout strategy.
Identify key failure modes, especially around feature drift, false positives, stale features, and regional outages.

Constraints

The system must integrate with Meta-style internal service identity, audit logging, and regional control planes.
Some secrets back critical production paths and cannot be rotated synchronously without a fallback credential path.
Access logs are high-volume and partially delayed across regions by up to 3 minutes.
Security decisions must be explainable to infra and service owners.
Cost matters: only the highest-risk 0.5% of secrets can trigger expensive immediate rotation workflows each day.

Your Answer

Design ML-Powered Secret Rotation Prioritizer

Hard

ML System Design

Product Context

Scale

Signal	Value
Internal services	120,000
Secrets under management	1.8B active secrets
Secret reads/day	45B
Secret write / rotation events/day	220M
Peak policy-evaluation QPS	650K
Peak rotation-decision QPS	90K
End-to-end online latency budget	120ms p99
Regions	12 global regions

Task

Clarify the product goals, threat model, and success metrics for an ML-based secret rotation system.
Design the end-to-end architecture, including candidate generation, risk ranking, and final re-ranking / policy enforcement before rotation.
Choose models and features for each stage, and explain what runs online versus batch.
Define the data pipeline, labels, training cadence, and how to avoid training-serving skew.
Propose offline and online evaluation, including security guardrails and rollout strategy.
Identify key failure modes, especially around feature drift, false positives, stale features, and regional outages.

Constraints

The system must integrate with Meta-style internal service identity, audit logging, and regional control planes.
Some secrets back critical production paths and cannot be rotated synchronously without a fallback credential path.
Access logs are high-volume and partially delayed across regions by up to 3 minutes.
Security decisions must be explainable to infra and service owners.
Cost matters: only the highest-risk 0.5% of secrets can trigger expensive immediate rotation workflows each day.

Design ML-Powered Secret Rotation Prioritizer

Hard

ML System Design

Product Context

Scale

Signal	Value
Internal services	120,000
Secrets under management	1.8B active secrets
Secret reads/day	45B
Secret write / rotation events/day	220M
Peak policy-evaluation QPS	650K
Peak rotation-decision QPS	90K
End-to-end online latency budget	120ms p99
Regions	12 global regions

Task

Clarify the product goals, threat model, and success metrics for an ML-based secret rotation system.
Design the end-to-end architecture, including candidate generation, risk ranking, and final re-ranking / policy enforcement before rotation.
Choose models and features for each stage, and explain what runs online versus batch.
Define the data pipeline, labels, training cadence, and how to avoid training-serving skew.
Propose offline and online evaluation, including security guardrails and rollout strategy.
Identify key failure modes, especially around feature drift, false positives, stale features, and regional outages.

Constraints

The system must integrate with Meta-style internal service identity, audit logging, and regional control planes.
Some secrets back critical production paths and cannot be rotated synchronously without a fallback credential path.
Access logs are high-volume and partially delayed across regions by up to 3 minutes.
Security decisions must be explainable to infra and service owners.
Cost matters: only the highest-risk 0.5% of secrets can trigger expensive immediate rotation workflows each day.