Design LLM API Rate Limiter

Context

NovaFlow provides an API that summarizes support tickets and drafts replies using a third-party LLM provider. The product team needs a rate-limiting and admission-control service in front of the provider to prevent quota overruns, control spend, and degrade gracefully during provider incidents.

Constraints

Peak traffic: 1,200 requests/second, average 250 requests/second
p95 added latency from the rate-limiting layer: under 40ms
Monthly LLM spend ceiling: $180K
Hard provider quotas: 25K requests/minute and 18M input tokens/minute across all tenants
Hallucination ceiling: under 2% on a golden set for requests that are admitted and answered
Must resist prompt injection attempts that try to bypass policy or force expensive tool/model paths
Must support tenant-level fairness: no single enterprise customer can consume more than 20% of minute-level capacity during contention

Available Resources

Per-request metadata before the LLM call: tenant_id, user_id, endpoint, prompt template version, estimated input tokens, priority tier, and safety classifier score
Historical logs with actual token usage, latency, cache hit rate, model chosen, and user feedback
Two provider models: a cheaper fast model and a slower higher-quality model
Redis, Kafka, Postgres, and a feature flag system are available

Task

Design a rate-limiting and budget-enforcement service for LLM API calls, including request-level token estimation, tenant fairness, retries, and fallback behavior.
Define how you would evaluate the system offline and online before launch, including cost control, latency, admission accuracy, and quality impact from throttling or model downgrades.
Propose the prompting and policy layer that prevents user prompts or retrieved content from overriding routing, budget, or safety rules.
Estimate cost and latency under peak load, and explain how the system behaves during provider quota exhaustion or partial outages.
Identify failure modes, monitoring, and mitigations for hallucination, prompt injection, token-estimation errors, and unfair capacity allocation.

Constraints

Peak traffic: 1,200 requests/second, average 250 requests/second

p95 added latency from the rate-limiting layer: under 40ms

Monthly LLM spend ceiling: $180K

Hard provider quotas: 25K requests/minute and 18M input tokens/minute across all tenants

Hallucination ceiling: under 2% on a golden set for requests that are admitted and answered

Must resist prompt injection attempts that try to bypass policy or force expensive tool/model paths

Must support tenant-level fairness: no single enterprise customer can consume more than 20% of minute-level capacity during contention

Available Resources

Per-request metadata before the LLM call: tenant_id, user_id, endpoint, prompt template version, estimated input tokens, priority tier, and safety classifier score

Historical logs with actual token usage, latency, cache hit rate, model chosen, and user feedback

Two provider models: a cheaper fast model and a slower higher-quality model

Redis, Kafka, Postgres, and a feature flag system are available

Task

Design a rate-limiting and budget-enforcement service for LLM API calls, including request-level token estimation, tenant fairness, retries, and fallback behavior.

Define how you would evaluate the system offline and online before launch, including cost control, latency, admission accuracy, and quality impact from throttling or model downgrades.

Propose the prompting and policy layer that prevents user prompts or retrieved content from overriding routing, budget, or safety rules.

Estimate cost and latency under peak load, and explain how the system behaves during provider quota exhaustion or partial outages.

Identify failure modes, monitoring, and mitigations for hallucination, prompt injection, token-estimation errors, and unfair capacity allocation.

Constraints

Peak traffic: 1,200 requests/second, average 250 requests/second

p95 added latency from the rate-limiting layer: under 40ms

Monthly LLM spend ceiling: $180K

Hard provider quotas: 25K requests/minute and 18M input tokens/minute across all tenants

Hallucination ceiling: under 2% on a golden set for requests that are admitted and answered

Must resist prompt injection attempts that try to bypass policy or force expensive tool/model paths

Must support tenant-level fairness: no single enterprise customer can consume more than 20% of minute-level capacity during contention

Available Resources

Per-request metadata before the LLM call: tenant_id, user_id, endpoint, prompt template version, estimated input tokens, priority tier, and safety classifier score

Historical logs with actual token usage, latency, cache hit rate, model chosen, and user feedback

Two provider models: a cheaper fast model and a slower higher-quality model

Redis, Kafka, Postgres, and a feature flag system are available

Task

Design a rate-limiting and budget-enforcement service for LLM API calls, including request-level token estimation, tenant fairness, retries, and fallback behavior.

Define how you would evaluate the system offline and online before launch, including cost control, latency, admission accuracy, and quality impact from throttling or model downgrades.

Propose the prompting and policy layer that prevents user prompts or retrieved content from overriding routing, budget, or safety rules.

Estimate cost and latency under peak load, and explain how the system behaves during provider quota exhaustion or partial outages.

Identify failure modes, monitoring, and mitigations for hallucination, prompt injection, token-estimation errors, and unfair capacity allocation.

Constraints

Peak traffic: 1,200 requests/second, average 250 requests/second

p95 added latency from the rate-limiting layer: under 40ms

Monthly LLM spend ceiling: $180K

Hard provider quotas: 25K requests/minute and 18M input tokens/minute across all tenants

Hallucination ceiling: under 2% on a golden set for requests that are admitted and answered

Must resist prompt injection attempts that try to bypass policy or force expensive tool/model paths

Must support tenant-level fairness: no single enterprise customer can consume more than 20% of minute-level capacity during contention

Available Resources

Per-request metadata before the LLM call: tenant_id, user_id, endpoint, prompt template version, estimated input tokens, priority tier, and safety classifier score

Historical logs with actual token usage, latency, cache hit rate, model chosen, and user feedback

Two provider models: a cheaper fast model and a slower higher-quality model

Redis, Kafka, Postgres, and a feature flag system are available

Task

Design a rate-limiting and budget-enforcement service for LLM API calls, including request-level token estimation, tenant fairness, retries, and fallback behavior.

Define how you would evaluate the system offline and online before launch, including cost control, latency, admission accuracy, and quality impact from throttling or model downgrades.

Propose the prompting and policy layer that prevents user prompts or retrieved content from overriding routing, budget, or safety rules.

Estimate cost and latency under peak load, and explain how the system behaves during provider quota exhaustion or partial outages.

Identify failure modes, monitoring, and mitigations for hallucination, prompt injection, token-estimation errors, and unfair capacity allocation.

Interview Guides

Context

Constraints

Available Resources

Task

Design LLM API Rate Limiter

Context

Constraints

Available Resources

Task

Your Answer

Design LLM API Rate Limiter

Context

Constraints

Available Resources

Task

Design LLM API Rate Limiter

Context

Constraints

Available Resources

Task

Your Answer