Design ML-Powered API Rate Limiter

Product Context

DevFlow provides a public API used by third-party developers for payments, messaging, and analytics. The company wants to replace static per-key quotas with an ML-driven rate-limiting system that predicts abuse risk and allocates request budgets dynamically while protecting legitimate traffic.

Scale

Signal	Value
API consumers	8M registered API keys
Daily active API keys	1.2M
Peak request rate	450K QPS
Regions	3 active-active regions
Historical request logs	25B requests / day
Feature freshness target	< 60 seconds for behavioral features
Decision latency budget	15ms p99 added latency

Task

Design an end-to-end ML system for adaptive rate limiting. Your design should address:

How to define the prediction target and business objective: abuse prevention, fairness to legitimate developers, and minimizing false throttling.
The online serving architecture for making per-request allow / throttle / block decisions under a strict latency budget.
A multi-stage decision pipeline (fast retrieval / coarse filtering → ML risk scoring → policy re-ranking or final action selection) and the models used at each stage.
The offline and streaming data pipelines, including labels, delayed feedback, feature computation, and training cadence.
Evaluation strategy: offline metrics, online experiments, and operational guardrails.
Failure modes such as feature drift, training-serving skew, adversarial adaptation, and regional outages.

Constraints

The system must never fully depend on ML; a deterministic fallback limiter is required.
False positives are expensive: accidentally throttling top enterprise customers can cause revenue loss.
Some abuse labels arrive hours or days later from fraud investigations, chargebacks, or downstream incident reports.
The service must support tenant-specific policies, burst allowances, and compliance requirements for auditability.
Cost matters: the incremental ML decisioning layer should stay below $0.00005 per request on average.

Signal

Value

API consumers

8M registered API keys

Daily active API keys

1.2M

Peak request rate

450K QPS

Regions

3 active-active regions

Historical request logs

25B requests / day

Feature freshness target

< 60 seconds for behavioral features

Decision latency budget

15ms p99 added latency

Task

Design an end-to-end ML system for adaptive rate limiting. Your design should address:

How to define the prediction target and business objective: abuse prevention, fairness to legitimate developers, and minimizing false throttling.

The online serving architecture for making per-request allow / throttle / block decisions under a strict latency budget.

A multi-stage decision pipeline (fast retrieval / coarse filtering → ML risk scoring → policy re-ranking or final action selection) and the models used at each stage.

The offline and streaming data pipelines, including labels, delayed feedback, feature computation, and training cadence.

Evaluation strategy: offline metrics, online experiments, and operational guardrails.

Failure modes such as feature drift, training-serving skew, adversarial adaptation, and regional outages.

Constraints

The system must never fully depend on ML; a deterministic fallback limiter is required.

False positives are expensive: accidentally throttling top enterprise customers can cause revenue loss.

Some abuse labels arrive hours or days later from fraud investigations, chargebacks, or downstream incident reports.

The service must support tenant-specific policies, burst allowances, and compliance requirements for auditability.

Cost matters: the incremental ML decisioning layer should stay below $0.00005 per request on average.

Signal

Value

API consumers

8M registered API keys

Daily active API keys

1.2M

Peak request rate

450K QPS

Regions

3 active-active regions

Historical request logs

25B requests / day

Feature freshness target

< 60 seconds for behavioral features

Decision latency budget

15ms p99 added latency

Task

Design an end-to-end ML system for adaptive rate limiting. Your design should address:

How to define the prediction target and business objective: abuse prevention, fairness to legitimate developers, and minimizing false throttling.

The online serving architecture for making per-request allow / throttle / block decisions under a strict latency budget.

A multi-stage decision pipeline (fast retrieval / coarse filtering → ML risk scoring → policy re-ranking or final action selection) and the models used at each stage.

The offline and streaming data pipelines, including labels, delayed feedback, feature computation, and training cadence.

Evaluation strategy: offline metrics, online experiments, and operational guardrails.

Failure modes such as feature drift, training-serving skew, adversarial adaptation, and regional outages.

Constraints

The system must never fully depend on ML; a deterministic fallback limiter is required.

False positives are expensive: accidentally throttling top enterprise customers can cause revenue loss.

Some abuse labels arrive hours or days later from fraud investigations, chargebacks, or downstream incident reports.

The service must support tenant-specific policies, burst allowances, and compliance requirements for auditability.

Cost matters: the incremental ML decisioning layer should stay below $0.00005 per request on average.

Signal

Value

API consumers

8M registered API keys

Daily active API keys

1.2M

Peak request rate

450K QPS

Regions

3 active-active regions

Historical request logs

25B requests / day

Feature freshness target

< 60 seconds for behavioral features

Decision latency budget

15ms p99 added latency

Task

Design an end-to-end ML system for adaptive rate limiting. Your design should address:

How to define the prediction target and business objective: abuse prevention, fairness to legitimate developers, and minimizing false throttling.

The online serving architecture for making per-request allow / throttle / block decisions under a strict latency budget.

A multi-stage decision pipeline (fast retrieval / coarse filtering → ML risk scoring → policy re-ranking or final action selection) and the models used at each stage.

The offline and streaming data pipelines, including labels, delayed feedback, feature computation, and training cadence.

Evaluation strategy: offline metrics, online experiments, and operational guardrails.

Failure modes such as feature drift, training-serving skew, adversarial adaptation, and regional outages.

Constraints

The system must never fully depend on ML; a deterministic fallback limiter is required.

False positives are expensive: accidentally throttling top enterprise customers can cause revenue loss.

Some abuse labels arrive hours or days later from fraud investigations, chargebacks, or downstream incident reports.

The service must support tenant-specific policies, burst allowances, and compliance requirements for auditability.

Cost matters: the incremental ML decisioning layer should stay below $0.00005 per request on average.

Interview Guides

Product Context

Scale

Task

Constraints

Design ML-Powered API Rate Limiter

Product Context

Scale

Task

Constraints

Your Answer

Design ML-Powered API Rate Limiter

Product Context

Scale

Task

Constraints

Design ML-Powered API Rate Limiter

Product Context

Scale

Task

Constraints

Your Answer