Stress-Test Claude Inference API

Product Context

Anthropic is reviewing the API design for a high-traffic Claude inference service used by both developers on the Anthropic API and internal surfaces such as the Claude app. The service must route requests across multiple model variants, enforce policy checks, and return low-latency responses while supporting rapid model iteration.

Scale

Signal	Value
DAU	25M across API customers + Claude app
Peak request QPS	180K requests/sec
Streaming response share	70% of requests
Supported model variants	12 active Claude SKUs
Prompt + context tokens/day	45T tokens/day
p99 time-to-first-token budget	900ms
p99 end-to-end budget (non-streaming)	4.5s

Assume the current API design is roughly: Client -> API Gateway -> Auth/Quota -> Request Router -> Safety/Policy -> Prompt Cache -> Model Selection -> Inference Workers -> Post-processing -> Logging/Billing.

Task

Review this design and explain where you expect it to break first as traffic grows and model/product complexity increases.

Clarify the functional and non-functional requirements for the inference service.
Size the system and identify the likely bottlenecks across admission control, routing, feature lookup, model serving, and logging.
Propose an end-to-end architecture, including any retrieval/routing/ranking stages for model selection and fallback.
Define the online vs batch components: what must happen synchronously per request vs asynchronously.
Describe how you would evaluate the system offline and online, including quality, latency, cost, and safety metrics.
Identify the top failure modes, especially feature drift, training-serving skew, hot keys, overload, and degraded dependencies.

Constraints

Must support both streaming and non-streaming responses.
Some enterprise customers require regional data residency and audit logs.
Prompt caching can reduce cost materially, but cache misses are common for long-tail workloads.
Router decisions may use customer tier, prompt metadata, historical latency, and task-type classifiers.
Cost matters: average serving cost must stay under $0.012/request blended across models.

Product Context

Scale

Signal	Value
DAU	25M across API customers + Claude app
Peak request QPS	180K requests/sec
Streaming response share	70% of requests
Supported model variants	12 active Claude SKUs
Prompt + context tokens/day	45T tokens/day
p99 time-to-first-token budget	900ms
p99 end-to-end budget (non-streaming)	4.5s

Task

Review this design and explain where you expect it to break first as traffic grows and model/product complexity increases.

Clarify the functional and non-functional requirements for the inference service.
Size the system and identify the likely bottlenecks across admission control, routing, feature lookup, model serving, and logging.
Propose an end-to-end architecture, including any retrieval/routing/ranking stages for model selection and fallback.
Define the online vs batch components: what must happen synchronously per request vs asynchronously.
Describe how you would evaluate the system offline and online, including quality, latency, cost, and safety metrics.
Identify the top failure modes, especially feature drift, training-serving skew, hot keys, overload, and degraded dependencies.

Constraints

Must support both streaming and non-streaming responses.
Some enterprise customers require regional data residency and audit logs.
Prompt caching can reduce cost materially, but cache misses are common for long-tail workloads.
Router decisions may use customer tier, prompt metadata, historical latency, and task-type classifiers.
Cost matters: average serving cost must stay under $0.012/request blended across models.

Product Context

Scale

Signal	Value
DAU	25M across API customers + Claude app
Peak request QPS	180K requests/sec
Streaming response share	70% of requests
Supported model variants	12 active Claude SKUs
Prompt + context tokens/day	45T tokens/day
p99 time-to-first-token budget	900ms
p99 end-to-end budget (non-streaming)	4.5s

Task

Review this design and explain where you expect it to break first as traffic grows and model/product complexity increases.

Clarify the functional and non-functional requirements for the inference service.
Size the system and identify the likely bottlenecks across admission control, routing, feature lookup, model serving, and logging.
Propose an end-to-end architecture, including any retrieval/routing/ranking stages for model selection and fallback.
Define the online vs batch components: what must happen synchronously per request vs asynchronously.
Describe how you would evaluate the system offline and online, including quality, latency, cost, and safety metrics.
Identify the top failure modes, especially feature drift, training-serving skew, hot keys, overload, and degraded dependencies.

Constraints

Must support both streaming and non-streaming responses.
Some enterprise customers require regional data residency and audit logs.
Prompt caching can reduce cost materially, but cache misses are common for long-tail workloads.
Router decisions may use customer tier, prompt metadata, historical latency, and task-type classifiers.
Cost matters: average serving cost must stay under $0.012/request blended across models.

Product Context

Scale

Signal	Value
DAU	25M across API customers + Claude app
Peak request QPS	180K requests/sec
Streaming response share	70% of requests
Supported model variants	12 active Claude SKUs
Prompt + context tokens/day	45T tokens/day
p99 time-to-first-token budget	900ms
p99 end-to-end budget (non-streaming)	4.5s

Task

Review this design and explain where you expect it to break first as traffic grows and model/product complexity increases.

Clarify the functional and non-functional requirements for the inference service.
Size the system and identify the likely bottlenecks across admission control, routing, feature lookup, model serving, and logging.
Propose an end-to-end architecture, including any retrieval/routing/ranking stages for model selection and fallback.
Define the online vs batch components: what must happen synchronously per request vs asynchronously.
Describe how you would evaluate the system offline and online, including quality, latency, cost, and safety metrics.
Identify the top failure modes, especially feature drift, training-serving skew, hot keys, overload, and degraded dependencies.

Constraints

Must support both streaming and non-streaming responses.
Some enterprise customers require regional data residency and audit logs.
Prompt caching can reduce cost materially, but cache misses are common for long-tail workloads.
Router decisions may use customer tier, prompt metadata, historical latency, and task-type classifiers.
Cost matters: average serving cost must stay under $0.012/request blended across models.

Interview Guides

Product Context

Scale

Task

Constraints

Stress-Test Claude Inference API

Product Context

Scale

Task

Constraints

Your Answer

Stress-Test Claude Inference API

Product Context

Scale

Task

Constraints

Stress-Test Claude Inference API

Product Context

Scale

Task

Constraints

Your Answer