Context
FinFlow, a B2B fintech SaaS company, uses a custom GPT-powered assistant to answer customer-support questions using account metadata, product docs, and policy articles. Customers report that response times have become inconsistent, with some requests taking 8-15 seconds, causing agent drop-off and manual escalations.
Constraints
- p95 end-to-end latency target: < 2,500ms
- p99 latency target: < 5,000ms
- Cost ceiling: < $0.03 per request at 1M requests/month
- Hallucination rate: < 2% on a labeled support QA set
- Prompt injection success rate: ~0% on adversarial test prompts
- Any latency optimization must not materially reduce answer quality or violate grounding requirements
Available Resources
- Request traces with timestamps for: auth, retrieval, reranking, prompt assembly, LLM inference, post-processing, and guardrails
- 50K historical support conversations with user feedback and resolution outcomes
- 400 labeled evaluation prompts with expected answers, citations, and refusal cases
- Current stack: hybrid retrieval over product docs, GPT-class model for generation, optional moderation and citation checks
- Ability to change prompts, retrieval depth, model routing, caching, and asynchronous orchestration
Task
- Propose a step-by-step debugging plan to identify the main contributors to latency across the full request path, including instrumentation you would add first.
- Define an evaluation-first approach so latency improvements are measured alongside answer quality, hallucination rate, refusal quality, and prompt-injection robustness.
- Recommend architectural and prompt-level changes to reduce latency while staying within cost and safety constraints. Be explicit about what you would test first versus later.
- Describe how you would separate issues caused by retrieval, prompt bloat, model choice, tool orchestration, network overhead, and post-processing.
- Provide a rollout and monitoring plan, including online metrics, guardrails, and rollback criteria.