Design a Request Batcher for LLM Serving with Latency Deadlines

Write code for a simplified batching layer in front of an LLM inference server. Incoming requests have an arrival time, token length estimate, and a hard latency deadline; the server can process batches up to a maximum total token budget. Implement a scheduler that groups requests into batches to maximize throughput while ensuring no request whose deadline could still be met is unnecessarily dropped. The function should output the batch assignments and any dropped requests, and you should explain the tradeoff your heuristic makes between fairness, latency, and utilization.

Interview Guides

Design a Request Batcher for LLM Serving with Latency Deadlines