Interview Guides

Control Cost for LLM Inference

Hard

Generative AI & LLMs

Context

FinPilot runs an LLM-powered support copilot for 4,000 agents handling billing, disputes, and policy questions. Usage has grown to 2.5M requests/day across chat summarization, grounded Q&A, and a small tool-using workflow; leadership wants tighter cost controls without losing answer quality or observability.

Constraints

p95 latency: 1,800ms for Q&A, 2,500ms for tool-using flows
Monthly inference budget: $180K at current volume
Hallucination rate: <2% on a 1,000-query golden set for grounded answers
Prompt-injection success rate: <0.5% on adversarial tests
Must support per-tenant cost attribution, rate limiting, and incident debugging
PII must be redacted from logs; prompts and outputs must remain auditable

Available Resources

Historical request logs with prompt, model, token counts, latency, user feedback, and escalation outcome
120K internal help-center articles and policy docs, already permissioned by tenant
Approved models: GPT-4.1-mini, GPT-4.1, and a cheaper summarization model
Existing OpenTelemetry pipeline, Redis cache, and vector store with BM25 + dense retrieval
20 support leads available to label a golden set and review failures

Task

Design an observability and cost-control strategy for large-scale LLM inference, including what to log, aggregate, alert on, and expose in dashboards.
Propose an eval-first rollout plan: offline evaluation first, then online monitoring and experimentation for routing, caching, and fallback policies.
Define an inference architecture that balances model routing, retrieval, caching, and tool use under the latency and budget constraints.
Specify safeguards for hallucination, prompt injection, PII leakage, and runaway agent/tool loops.
Estimate token, latency, and monthly cost impacts for your design, and explain the main tradeoffs.

Control Cost for LLM Inference

Hard

Generative AI & LLMs

Context

Constraints

p95 latency: 1,800ms for Q&A, 2,500ms for tool-using flows
Monthly inference budget: $180K at current volume
Hallucination rate: <2% on a 1,000-query golden set for grounded answers
Prompt-injection success rate: <0.5% on adversarial tests
Must support per-tenant cost attribution, rate limiting, and incident debugging
PII must be redacted from logs; prompts and outputs must remain auditable

Available Resources

Historical request logs with prompt, model, token counts, latency, user feedback, and escalation outcome
120K internal help-center articles and policy docs, already permissioned by tenant
Approved models: GPT-4.1-mini, GPT-4.1, and a cheaper summarization model
Existing OpenTelemetry pipeline, Redis cache, and vector store with BM25 + dense retrieval
20 support leads available to label a golden set and review failures

Task

Design an observability and cost-control strategy for large-scale LLM inference, including what to log, aggregate, alert on, and expose in dashboards.
Propose an eval-first rollout plan: offline evaluation first, then online monitoring and experimentation for routing, caching, and fallback policies.
Define an inference architecture that balances model routing, retrieval, caching, and tool use under the latency and budget constraints.
Specify safeguards for hallucination, prompt injection, PII leakage, and runaway agent/tool loops.
Estimate token, latency, and monthly cost impacts for your design, and explain the main tradeoffs.

Your Answer

Control Cost for LLM Inference

Hard

Generative AI & LLMs

Context

Constraints

p95 latency: 1,800ms for Q&A, 2,500ms for tool-using flows
Monthly inference budget: $180K at current volume
Hallucination rate: <2% on a 1,000-query golden set for grounded answers
Prompt-injection success rate: <0.5% on adversarial tests
Must support per-tenant cost attribution, rate limiting, and incident debugging
PII must be redacted from logs; prompts and outputs must remain auditable

Available Resources

Historical request logs with prompt, model, token counts, latency, user feedback, and escalation outcome
120K internal help-center articles and policy docs, already permissioned by tenant
Approved models: GPT-4.1-mini, GPT-4.1, and a cheaper summarization model
Existing OpenTelemetry pipeline, Redis cache, and vector store with BM25 + dense retrieval
20 support leads available to label a golden set and review failures

Task

Design an observability and cost-control strategy for large-scale LLM inference, including what to log, aggregate, alert on, and expose in dashboards.
Propose an eval-first rollout plan: offline evaluation first, then online monitoring and experimentation for routing, caching, and fallback policies.
Define an inference architecture that balances model routing, retrieval, caching, and tool use under the latency and budget constraints.
Specify safeguards for hallucination, prompt injection, PII leakage, and runaway agent/tool loops.
Estimate token, latency, and monthly cost impacts for your design, and explain the main tradeoffs.

Control Cost for LLM Inference

Hard

Generative AI & LLMs

Context

Constraints

p95 latency: 1,800ms for Q&A, 2,500ms for tool-using flows
Monthly inference budget: $180K at current volume
Hallucination rate: <2% on a 1,000-query golden set for grounded answers
Prompt-injection success rate: <0.5% on adversarial tests
Must support per-tenant cost attribution, rate limiting, and incident debugging
PII must be redacted from logs; prompts and outputs must remain auditable

Available Resources

Historical request logs with prompt, model, token counts, latency, user feedback, and escalation outcome
120K internal help-center articles and policy docs, already permissioned by tenant
Approved models: GPT-4.1-mini, GPT-4.1, and a cheaper summarization model
Existing OpenTelemetry pipeline, Redis cache, and vector store with BM25 + dense retrieval
20 support leads available to label a golden set and review failures

Task

Design an observability and cost-control strategy for large-scale LLM inference, including what to log, aggregate, alert on, and expose in dashboards.
Propose an eval-first rollout plan: offline evaluation first, then online monitoring and experimentation for routing, caching, and fallback policies.
Define an inference architecture that balances model routing, retrieval, caching, and tool use under the latency and budget constraints.
Specify safeguards for hallucination, prompt injection, PII leakage, and runaway agent/tool loops.
Estimate token, latency, and monthly cost impacts for your design, and explain the main tradeoffs.