Diagnose Customer-Specific Prompt Failures

Context

BrightDesk uses an LLM assistant to draft customer-support replies from a shared prompt template plus tenant-specific account data. The prompt performs well for most enterprise customers, but one large customer reports frequent wrong answers, refusals, and occasional policy violations.

Constraints

p95 latency: 1,500ms end-to-end
Cost ceiling: $0.015 per request, with 2M requests/month
Hallucination rate: <2% on a customer-segmented golden set
Prompt-injection success rate: <1% on adversarial tests
No customer data may leak across tenants; logs must be redactable for PII review
You may not solve the issue by immediately fine-tuning a new model unless eval shows prompt-only fixes are insufficient

Available Resources

Current system prompt, developer prompt, and 30 days of anonymized request/response logs
Per-customer metadata: industry, supported products, policy pack, tone settings, locale, and average input length
1,200 labeled conversations across 8 customers, including failure labels: hallucination, missed instruction, bad refusal, formatting error, unsafe answer
Approved models: GPT-4.1 mini for production, GPT-4.1 for evaluation and targeted retries
Optional retrieval of tenant-specific policy snippets and product docs

Task

Propose a step-by-step troubleshooting plan for a prompt that succeeds for Customer A but fails for Customer B. Your plan should isolate whether the issue comes from prompt wording, customer-specific context, retrieval quality, formatting, model limits, or safety conflicts.
Define an evaluation-first workflow: what offline slices, adversarial tests, and online metrics you would use before changing architecture.
Design an improved prompt and serving strategy that preserves latency/cost targets while reducing customer-specific failures.
Explain how you would detect and mitigate prompt injection, overfitting to one customer, and hidden regressions for other customers.
Provide implementation details for a reproducible debugging harness in Python, including structured outputs for failure analysis.

Constraints

p95 latency: 1,500ms end-to-end

Cost ceiling: $0.015 per request, with 2M requests/month

Hallucination rate: <2% on a customer-segmented golden set

Prompt-injection success rate: <1% on adversarial tests

No customer data may leak across tenants; logs must be redactable for PII review

You may not solve the issue by immediately fine-tuning a new model unless eval shows prompt-only fixes are insufficient

Available Resources

Current system prompt, developer prompt, and 30 days of anonymized request/response logs

Per-customer metadata: industry, supported products, policy pack, tone settings, locale, and average input length

1,200 labeled conversations across 8 customers, including failure labels: hallucination, missed instruction, bad refusal, formatting error, unsafe answer

Approved models: GPT-4.1 mini for production, GPT-4.1 for evaluation and targeted retries

Optional retrieval of tenant-specific policy snippets and product docs

Task

Propose a step-by-step troubleshooting plan for a prompt that succeeds for Customer A but fails for Customer B. Your plan should isolate whether the issue comes from prompt wording, customer-specific context, retrieval quality, formatting, model limits, or safety conflicts.

Define an evaluation-first workflow: what offline slices, adversarial tests, and online metrics you would use before changing architecture.

Design an improved prompt and serving strategy that preserves latency/cost targets while reducing customer-specific failures.

Explain how you would detect and mitigate prompt injection, overfitting to one customer, and hidden regressions for other customers.

Provide implementation details for a reproducible debugging harness in Python, including structured outputs for failure analysis.

Constraints

p95 latency: 1,500ms end-to-end

Cost ceiling: $0.015 per request, with 2M requests/month

Hallucination rate: <2% on a customer-segmented golden set

Prompt-injection success rate: <1% on adversarial tests

No customer data may leak across tenants; logs must be redactable for PII review

You may not solve the issue by immediately fine-tuning a new model unless eval shows prompt-only fixes are insufficient

Available Resources

Current system prompt, developer prompt, and 30 days of anonymized request/response logs

Per-customer metadata: industry, supported products, policy pack, tone settings, locale, and average input length

1,200 labeled conversations across 8 customers, including failure labels: hallucination, missed instruction, bad refusal, formatting error, unsafe answer

Approved models: GPT-4.1 mini for production, GPT-4.1 for evaluation and targeted retries

Optional retrieval of tenant-specific policy snippets and product docs

Task

Define an evaluation-first workflow: what offline slices, adversarial tests, and online metrics you would use before changing architecture.

Design an improved prompt and serving strategy that preserves latency/cost targets while reducing customer-specific failures.

Explain how you would detect and mitigate prompt injection, overfitting to one customer, and hidden regressions for other customers.

Provide implementation details for a reproducible debugging harness in Python, including structured outputs for failure analysis.

Constraints

p95 latency: 1,500ms end-to-end

Cost ceiling: $0.015 per request, with 2M requests/month

Hallucination rate: <2% on a customer-segmented golden set

Prompt-injection success rate: <1% on adversarial tests

No customer data may leak across tenants; logs must be redactable for PII review

You may not solve the issue by immediately fine-tuning a new model unless eval shows prompt-only fixes are insufficient

Available Resources

Current system prompt, developer prompt, and 30 days of anonymized request/response logs

Per-customer metadata: industry, supported products, policy pack, tone settings, locale, and average input length

1,200 labeled conversations across 8 customers, including failure labels: hallucination, missed instruction, bad refusal, formatting error, unsafe answer

Approved models: GPT-4.1 mini for production, GPT-4.1 for evaluation and targeted retries

Optional retrieval of tenant-specific policy snippets and product docs

Task

Define an evaluation-first workflow: what offline slices, adversarial tests, and online metrics you would use before changing architecture.

Design an improved prompt and serving strategy that preserves latency/cost targets while reducing customer-specific failures.

Explain how you would detect and mitigate prompt injection, overfitting to one customer, and hidden regressions for other customers.

Provide implementation details for a reproducible debugging harness in Python, including structured outputs for failure analysis.

Interview Guides

Context

Constraints

Available Resources

Task

Diagnose Customer-Specific Prompt Failures

Context

Constraints

Available Resources

Task

Your Answer

Diagnose Customer-Specific Prompt Failures

Context

Constraints

Available Resources

Task

Diagnose Customer-Specific Prompt Failures

Context

Constraints

Available Resources

Task

Your Answer