Context
NovaFlow provides an API that summarizes support tickets and drafts replies using a third-party LLM provider. The product team needs a rate-limiting and admission-control service in front of the provider to prevent quota overruns, control spend, and degrade gracefully during provider incidents.
Constraints
- Peak traffic: 1,200 requests/second, average 250 requests/second
- p95 added latency from the rate-limiting layer: under 40ms
- Monthly LLM spend ceiling: $180K
- Hard provider quotas: 25K requests/minute and 18M input tokens/minute across all tenants
- Hallucination ceiling: under 2% on a golden set for requests that are admitted and answered
- Must resist prompt injection attempts that try to bypass policy or force expensive tool/model paths
- Must support tenant-level fairness: no single enterprise customer can consume more than 20% of minute-level capacity during contention
Available Resources
- Per-request metadata before the LLM call: tenant_id, user_id, endpoint, prompt template version, estimated input tokens, priority tier, and safety classifier score
- Historical logs with actual token usage, latency, cache hit rate, model chosen, and user feedback
- Two provider models: a cheaper fast model and a slower higher-quality model
- Redis, Kafka, Postgres, and a feature flag system are available
Task
- Design a rate-limiting and budget-enforcement service for LLM API calls, including request-level token estimation, tenant fairness, retries, and fallback behavior.
- Define how you would evaluate the system offline and online before launch, including cost control, latency, admission accuracy, and quality impact from throttling or model downgrades.
- Propose the prompting and policy layer that prevents user prompts or retrieved content from overriding routing, budget, or safety rules.
- Estimate cost and latency under peak load, and explain how the system behaves during provider quota exhaustion or partial outages.
- Identify failure modes, monitoring, and mitigations for hallucination, prompt injection, token-estimation errors, and unfair capacity allocation.