LexiCare has deployed a customer-support generative AI assistant that answers insurance policy questions and drafts claim guidance for users in chat. The team is seeing strong engagement, but compliance reviewers found some unsafe and inaccurate responses in production.
| Metric | Internal Target | Current Model | Previous Model |
|---|---|---|---|
| Helpfulness score (1-5) | 4.2 | 4.3 | 4.0 |
| Factual accuracy | 95% | 88% | 84% |
| Policy compliance pass rate | 99% | 96% | 98% |
| Harmful output rate | <0.5% | 1.8% | 0.9% |
| Refusal precision | 90% | 78% | 85% |
| Refusal recall | 85% | 72% | 80% |
| Escalation to human agent | <12% | 15% | 10% |
| User satisfaction (CSAT) | 4.4 | 4.1 | 4.0 |
The new model is more helpful, but it appears less safe and less reliable. Leadership wants to know whether the model is acceptable for broader rollout and what evaluation framework should be used to balance answer quality with safety risk.