Product Context
Anthropic is reviewing the API design for a high-traffic Claude inference service used by both developers on the Anthropic API and internal surfaces such as the Claude app. The service must route requests across multiple model variants, enforce policy checks, and return low-latency responses while supporting rapid model iteration.
Scale
| Signal | Value |
|---|
| DAU | 25M across API customers + Claude app |
| Peak request QPS | 180K requests/sec |
| Streaming response share | 70% of requests |
| Supported model variants | 12 active Claude SKUs |
| Prompt + context tokens/day | 45T tokens/day |
| p99 time-to-first-token budget | 900ms |
| p99 end-to-end budget (non-streaming) | 4.5s |
Assume the current API design is roughly: Client -> API Gateway -> Auth/Quota -> Request Router -> Safety/Policy -> Prompt Cache -> Model Selection -> Inference Workers -> Post-processing -> Logging/Billing.
Task
Review this design and explain where you expect it to break first as traffic grows and model/product complexity increases.
- Clarify the functional and non-functional requirements for the inference service.
- Size the system and identify the likely bottlenecks across admission control, routing, feature lookup, model serving, and logging.
- Propose an end-to-end architecture, including any retrieval/routing/ranking stages for model selection and fallback.
- Define the online vs batch components: what must happen synchronously per request vs asynchronously.
- Describe how you would evaluate the system offline and online, including quality, latency, cost, and safety metrics.
- Identify the top failure modes, especially feature drift, training-serving skew, hot keys, overload, and degraded dependencies.
Constraints
- Must support both streaming and non-streaming responses.
- Some enterprise customers require regional data residency and audit logs.
- Prompt caching can reduce cost materially, but cache misses are common for long-tail workloads.
- Router decisions may use customer tier, prompt metadata, historical latency, and task-type classifiers.
- Cost matters: average serving cost must stay under $0.012/request blended across models.