Context
Northstar Health uses an internal LLM assistant to help HR, IT, and managers draft policy answers from company handbooks, benefits guides, and compliance documents. Leadership is concerned that the assistant may produce biased hiring guidance, unsafe advice, or confident but unsupported answers.
Constraints
- p95 latency: 1,500ms for interactive chat
- Cost ceiling: $12K/month at 300K requests/month
- Harmful or biased response rate: <1% on a red-team evaluation set
- Unsupported factual claims: <2% on a 400-question golden set
- Must not reveal PII, confidential employee data, or hidden system instructions
- Must resist prompt injection attempts in user messages and retrieved documents
- Escalate high-risk topics (termination, accommodations, protected classes, medical leave) instead of improvising
Available Resources
- 25K internal policy and compliance documents with metadata (department, region, effective date, sensitivity)
- Historical chat logs with thumbs-up/down labels and 800 manually reviewed unsafe examples
- Access to an approved GPT-4-class model and a smaller low-cost model
- Existing hybrid retrieval stack (BM25 + dense search) and a policy rules engine
- Legal and HR reviewers available for a 2-week offline evaluation sprint
Task
- Design an end-to-end mitigation strategy for harmful, biased, and unsupported outputs, including prompting, retrieval controls, moderation, and escalation paths.
- Define an eval-first plan: offline safety and bias benchmarks, hallucination checks, prompt-injection testing, and online guardrail metrics after launch.
- Propose the serving architecture and explain where to place classifiers, policy checks, and refusal logic while meeting latency and cost targets.
- Write a production-quality system prompt that enforces grounded, policy-safe behavior and structured outputs for risk handling.
- Estimate cost/latency tradeoffs and identify the most likely failure modes with concrete mitigations.