Debug High-Latency Custom GPT

Context

FinFlow, a B2B fintech SaaS company, uses a custom GPT-powered assistant to answer customer-support questions using account metadata, product docs, and policy articles. Customers report that response times have become inconsistent, with some requests taking 8-15 seconds, causing agent drop-off and manual escalations.

Constraints

p95 end-to-end latency target: < 2,500ms
p99 latency target: < 5,000ms
Cost ceiling: < $0.03 per request at 1M requests/month
Hallucination rate: < 2% on a labeled support QA set
Prompt injection success rate: ~0% on adversarial test prompts
Any latency optimization must not materially reduce answer quality or violate grounding requirements

Available Resources

Request traces with timestamps for: auth, retrieval, reranking, prompt assembly, LLM inference, post-processing, and guardrails
50K historical support conversations with user feedback and resolution outcomes
400 labeled evaluation prompts with expected answers, citations, and refusal cases
Current stack: hybrid retrieval over product docs, GPT-class model for generation, optional moderation and citation checks
Ability to change prompts, retrieval depth, model routing, caching, and asynchronous orchestration

Task

Propose a step-by-step debugging plan to identify the main contributors to latency across the full request path, including instrumentation you would add first.
Define an evaluation-first approach so latency improvements are measured alongside answer quality, hallucination rate, refusal quality, and prompt-injection robustness.
Recommend architectural and prompt-level changes to reduce latency while staying within cost and safety constraints. Be explicit about what you would test first versus later.
Describe how you would separate issues caused by retrieval, prompt bloat, model choice, tool orchestration, network overhead, and post-processing.
Provide a rollout and monitoring plan, including online metrics, guardrails, and rollback criteria.

Problem

Context

Constraints

p95 end-to-end latency target: < 2,500ms
p99 latency target: < 5,000ms
Cost ceiling: < $0.03 per request at 1M requests/month
Hallucination rate: < 2% on a labeled support QA set
Prompt injection success rate: ~0% on adversarial test prompts
Any latency optimization must not materially reduce answer quality or violate grounding requirements

Available Resources

Request traces with timestamps for: auth, retrieval, reranking, prompt assembly, LLM inference, post-processing, and guardrails
50K historical support conversations with user feedback and resolution outcomes
400 labeled evaluation prompts with expected answers, citations, and refusal cases
Current stack: hybrid retrieval over product docs, GPT-class model for generation, optional moderation and citation checks
Ability to change prompts, retrieval depth, model routing, caching, and asynchronous orchestration

Task

Propose a step-by-step debugging plan to identify the main contributors to latency across the full request path, including instrumentation you would add first.
Define an evaluation-first approach so latency improvements are measured alongside answer quality, hallucination rate, refusal quality, and prompt-injection robustness.
Recommend architectural and prompt-level changes to reduce latency while staying within cost and safety constraints. Be explicit about what you would test first versus later.
Describe how you would separate issues caused by retrieval, prompt bloat, model choice, tool orchestration, network overhead, and post-processing.
Provide a rollout and monitoring plan, including online metrics, guardrails, and rollback criteria.

Problem

Context

Constraints

p95 end-to-end latency target: < 2,500ms
p99 latency target: < 5,000ms
Cost ceiling: < $0.03 per request at 1M requests/month
Hallucination rate: < 2% on a labeled support QA set
Prompt injection success rate: ~0% on adversarial test prompts
Any latency optimization must not materially reduce answer quality or violate grounding requirements

Available Resources

Request traces with timestamps for: auth, retrieval, reranking, prompt assembly, LLM inference, post-processing, and guardrails
50K historical support conversations with user feedback and resolution outcomes
400 labeled evaluation prompts with expected answers, citations, and refusal cases
Current stack: hybrid retrieval over product docs, GPT-class model for generation, optional moderation and citation checks
Ability to change prompts, retrieval depth, model routing, caching, and asynchronous orchestration

Task

Propose a step-by-step debugging plan to identify the main contributors to latency across the full request path, including instrumentation you would add first.
Define an evaluation-first approach so latency improvements are measured alongside answer quality, hallucination rate, refusal quality, and prompt-injection robustness.
Recommend architectural and prompt-level changes to reduce latency while staying within cost and safety constraints. Be explicit about what you would test first versus later.
Describe how you would separate issues caused by retrieval, prompt bloat, model choice, tool orchestration, network overhead, and post-processing.
Provide a rollout and monitoring plan, including online metrics, guardrails, and rollback criteria.

Problem

Context

Constraints

p95 end-to-end latency target: < 2,500ms
p99 latency target: < 5,000ms
Cost ceiling: < $0.03 per request at 1M requests/month
Hallucination rate: < 2% on a labeled support QA set
Prompt injection success rate: ~0% on adversarial test prompts
Any latency optimization must not materially reduce answer quality or violate grounding requirements

Available Resources

Request traces with timestamps for: auth, retrieval, reranking, prompt assembly, LLM inference, post-processing, and guardrails
50K historical support conversations with user feedback and resolution outcomes
400 labeled evaluation prompts with expected answers, citations, and refusal cases
Current stack: hybrid retrieval over product docs, GPT-class model for generation, optional moderation and citation checks
Ability to change prompts, retrieval depth, model routing, caching, and asynchronous orchestration

Task

Propose a step-by-step debugging plan to identify the main contributors to latency across the full request path, including instrumentation you would add first.
Define an evaluation-first approach so latency improvements are measured alongside answer quality, hallucination rate, refusal quality, and prompt-injection robustness.
Recommend architectural and prompt-level changes to reduce latency while staying within cost and safety constraints. Be explicit about what you would test first versus later.
Describe how you would separate issues caused by retrieval, prompt bloat, model choice, tool orchestration, network overhead, and post-processing.
Provide a rollout and monitoring plan, including online metrics, guardrails, and rollback criteria.

Interview Guides

Problem

Context

Constraints

Available Resources

Task

Problem

Context

Constraints

Available Resources

Task

Debug High-Latency Custom GPT

Problem

Context

Constraints

Available Resources

Task

Problem

Context

Constraints

Available Resources

Task