Cut LLM Cost Without Quality Loss

Context

BrightDesk runs an AI assistant that drafts replies for customer-support agents using help-center articles, policy docs, and recent ticket history. Leadership wants to reduce LLM spend materially because usage has grown faster than expected, but support quality cannot regress.

Constraints

Current volume: 1.2M draft generations/month
Current average cost: $0.018 per request
Target: reduce total LLM cost by at least 50%
p95 latency must stay under 1,800ms
Hallucination rate must remain below 2.5% on a labeled support set
Prompt-injection success rate must be below 1% on adversarial tests
Escalation-to-human rate cannot worsen by more than 1 percentage point after launch

Available Resources

80K historical support tickets with agent-written final responses
6K help-center and policy documents, already permission-scoped
Existing baseline: a single large model call with a long prompt containing ticket text, account metadata, and top-10 retrieved docs
Access to one large model, one mid-tier model, and one small model from an approved provider
Existing evaluation assets: 800 labeled prompts with rubric scores for correctness, groundedness, tone, and policy compliance

Task

Propose a cost-reduction plan for the drafting system. Be specific about what changed: prompt compression, retrieval changes, model routing, caching, structured outputs, or a smaller model for some requests.
Define an evaluation-first rollout plan. Specify offline and online metrics, acceptance thresholds, and how you would prove quality was not harmed.
Design the prompt and serving architecture, including how you keep hallucination and prompt injection risk low while shrinking token usage.
Estimate cost and latency before vs. after your changes, including assumptions about token counts, routing percentages, and monthly volume.
Identify the main failure modes of your optimization plan and how you would detect and mitigate them in production.

Constraints

Current volume: 1.2M draft generations/month

Current average cost: $0.018 per request

Target: reduce total LLM cost by at least 50%

p95 latency must stay under 1,800ms

Hallucination rate must remain below 2.5% on a labeled support set

Prompt-injection success rate must be below 1% on adversarial tests

Escalation-to-human rate cannot worsen by more than 1 percentage point after launch

Available Resources

80K historical support tickets with agent-written final responses

6K help-center and policy documents, already permission-scoped

Existing baseline: a single large model call with a long prompt containing ticket text, account metadata, and top-10 retrieved docs

Access to one large model, one mid-tier model, and one small model from an approved provider

Existing evaluation assets: 800 labeled prompts with rubric scores for correctness, groundedness, tone, and policy compliance

Task

Propose a cost-reduction plan for the drafting system. Be specific about what changed: prompt compression, retrieval changes, model routing, caching, structured outputs, or a smaller model for some requests.

Define an evaluation-first rollout plan. Specify offline and online metrics, acceptance thresholds, and how you would prove quality was not harmed.

Design the prompt and serving architecture, including how you keep hallucination and prompt injection risk low while shrinking token usage.

Estimate cost and latency before vs. after your changes, including assumptions about token counts, routing percentages, and monthly volume.

Identify the main failure modes of your optimization plan and how you would detect and mitigate them in production.

Constraints

Current volume: 1.2M draft generations/month

Current average cost: $0.018 per request

Target: reduce total LLM cost by at least 50%

p95 latency must stay under 1,800ms

Hallucination rate must remain below 2.5% on a labeled support set

Prompt-injection success rate must be below 1% on adversarial tests

Escalation-to-human rate cannot worsen by more than 1 percentage point after launch

Available Resources

80K historical support tickets with agent-written final responses

6K help-center and policy documents, already permission-scoped

Existing baseline: a single large model call with a long prompt containing ticket text, account metadata, and top-10 retrieved docs

Access to one large model, one mid-tier model, and one small model from an approved provider

Existing evaluation assets: 800 labeled prompts with rubric scores for correctness, groundedness, tone, and policy compliance

Task

Define an evaluation-first rollout plan. Specify offline and online metrics, acceptance thresholds, and how you would prove quality was not harmed.

Design the prompt and serving architecture, including how you keep hallucination and prompt injection risk low while shrinking token usage.

Estimate cost and latency before vs. after your changes, including assumptions about token counts, routing percentages, and monthly volume.

Identify the main failure modes of your optimization plan and how you would detect and mitigate them in production.

Constraints

Current volume: 1.2M draft generations/month

Current average cost: $0.018 per request

Target: reduce total LLM cost by at least 50%

p95 latency must stay under 1,800ms

Hallucination rate must remain below 2.5% on a labeled support set

Prompt-injection success rate must be below 1% on adversarial tests

Escalation-to-human rate cannot worsen by more than 1 percentage point after launch

Available Resources

80K historical support tickets with agent-written final responses

6K help-center and policy documents, already permission-scoped

Existing baseline: a single large model call with a long prompt containing ticket text, account metadata, and top-10 retrieved docs

Access to one large model, one mid-tier model, and one small model from an approved provider

Existing evaluation assets: 800 labeled prompts with rubric scores for correctness, groundedness, tone, and policy compliance

Task

Define an evaluation-first rollout plan. Specify offline and online metrics, acceptance thresholds, and how you would prove quality was not harmed.

Design the prompt and serving architecture, including how you keep hallucination and prompt injection risk low while shrinking token usage.

Estimate cost and latency before vs. after your changes, including assumptions about token counts, routing percentages, and monthly volume.

Identify the main failure modes of your optimization plan and how you would detect and mitigate them in production.

Interview Guides

Context

Constraints

Available Resources

Task

Cut LLM Cost Without Quality Loss

Context

Constraints

Available Resources

Task

Your Answer

Cut LLM Cost Without Quality Loss

Context

Constraints

Available Resources

Task

Cut LLM Cost Without Quality Loss

Context

Constraints

Available Resources

Task

Your Answer