Design a multi-agent coding assistant for OpenAI that helps users debug API integrations, write small code patches, and validate outputs against tool results. The system should support 35 QPS steady state and 100 QPS peak, with P95 time-to-first-token under 1.5s and P95 task completion under 12s for standard requests, while keeping average total inference spend under $0.09 per task and fitting within a cluster of 24 H100s plus CPU workers for tools and sandboxing. Describe how you would decompose work across planner, retriever, code-execution, and critic agents; decide when to use parallel vs sequential agent calls; manage context windows and memory; and design the serving stack, admission control, batching, and degradation modes during traffic spikes. Finally, specify how you would evaluate the system end to end: task success, tool-use correctness, latency/cost-quality tradeoffs, safety failure modes, and an LLM-as-judge or human-eval framework robust enough to compare single-agent and multi-agent variants before launch.