Interview Guides

Build a Cost-and-Latency Aware LLM Router

Medium

Coding

You are given a stream of API requests, where each request includes an estimated prompt token count, expected completion token count, latency SLO in milliseconds, and a quality tier (standard or premium). Write a function that routes each request to one of several model endpoints with different prices, throughput limits, and p95 latencies, so that total cost is minimized while meeting latency and quality constraints whenever possible. The function should also return a short per-request explanation string suitable for a customer-facing architect to use when justifying the routing choice to a technical stakeholder. After implementing the router, describe how you would extend it to handle burst traffic and graceful degradation when no endpoint can satisfy all constraints. Expected solution outline: model each endpoint with cost per 1K tokens, max QPS, supported quality tier, and p95 latency; filter infeasible endpoints by quality and latency; compute estimated request cost; choose the cheapest feasible endpoint while tracking capacity consumption; if none are feasible, apply a fallback policy such as closest-SLO or premium-to-standard downgrade with an explicit explanation; discuss tradeoffs between static heuristics and dynamic load-aware routing, including customer-visible impacts on cost and latency.

Build a Cost-and-Latency Aware LLM Router

Medium

Coding

You are given a stream of API requests, where each request includes an estimated prompt token count, expected completion token count, latency SLO in milliseconds, and a quality tier (standard or premium). Write a function that routes each request to one of several model endpoints with different prices, throughput limits, and p95 latencies, so that total cost is minimized while meeting latency and quality constraints whenever possible. The function should also return a short per-request explanation string suitable for a customer-facing architect to use when justifying the routing choice to a technical stakeholder. After implementing the router, describe how you would extend it to handle burst traffic and graceful degradation when no endpoint can satisfy all constraints. Expected solution outline: model each endpoint with cost per 1K tokens, max QPS, supported quality tier, and p95 latency; filter infeasible endpoints by quality and latency; compute estimated request cost; choose the cheapest feasible endpoint while tracking capacity consumption; if none are feasible, apply a fallback policy such as closest-SLO or premium-to-standard downgrade with an explicit explanation; discuss tradeoffs between static heuristics and dynamic load-aware routing, including customer-visible impacts on cost and latency.

Your Answer

Build a Cost-and-Latency Aware LLM Router | Dataford Interview Questions - Dataford - Ace your Interview