Design a low-latency support copilot for a SaaS customer under strict budget

A B2B SaaS company wants to deploy an OpenAI-powered support copilot for 8,000 agents worldwide. The system must handle 120 QPS steady-state and 300 QPS during incident spikes, with P95 end-to-end latency under 700 ms for answer suggestions and 99.9% availability. Their CFO requires inference + retrieval spend to stay below $180k/month, and they can only dedicate a small platform team to operations. Walk through the architecture you would propose using OpenAI APIs and the surrounding stack (retrieval, caching, prompt routing, batching, observability, fallback paths), and explain the tradeoffs you would make among model quality, latency, and cost.

Interview Guides

Design a low-latency support copilot for a SaaS customer under strict budget