Context
BrightDesk uses an LLM assistant to draft customer-support replies from a shared prompt template plus tenant-specific account data. The prompt performs well for most enterprise customers, but one large customer reports frequent wrong answers, refusals, and occasional policy violations.
Constraints
- p95 latency: 1,500ms end-to-end
- Cost ceiling: $0.015 per request, with 2M requests/month
- Hallucination rate: <2% on a customer-segmented golden set
- Prompt-injection success rate: <1% on adversarial tests
- No customer data may leak across tenants; logs must be redactable for PII review
- You may not solve the issue by immediately fine-tuning a new model unless eval shows prompt-only fixes are insufficient
Available Resources
- Current system prompt, developer prompt, and 30 days of anonymized request/response logs
- Per-customer metadata: industry, supported products, policy pack, tone settings, locale, and average input length
- 1,200 labeled conversations across 8 customers, including failure labels: hallucination, missed instruction, bad refusal, formatting error, unsafe answer
- Approved models: GPT-4.1 mini for production, GPT-4.1 for evaluation and targeted retries
- Optional retrieval of tenant-specific policy snippets and product docs
Task
- Propose a step-by-step troubleshooting plan for a prompt that succeeds for Customer A but fails for Customer B. Your plan should isolate whether the issue comes from prompt wording, customer-specific context, retrieval quality, formatting, model limits, or safety conflicts.
- Define an evaluation-first workflow: what offline slices, adversarial tests, and online metrics you would use before changing architecture.
- Design an improved prompt and serving strategy that preserves latency/cost targets while reducing customer-specific failures.
- Explain how you would detect and mitigate prompt injection, overfitting to one customer, and hidden regressions for other customers.
- Provide implementation details for a reproducible debugging harness in Python, including structured outputs for failure analysis.