Evaluate GenAI Quality and Safety

Context

LexiCare has deployed a customer-support generative AI assistant that answers insurance policy questions and drafts claim guidance for users in chat. The team is seeing strong engagement, but compliance reviewers found some unsafe and inaccurate responses in production.

Current Performance

Metric	Internal Target	Current Model	Previous Model
Helpfulness score (1-5)	4.2	4.3	4.0
Factual accuracy	95%	88%	84%
Policy compliance pass rate	99%	96%	98%
Harmful output rate	<0.5%	1.8%	0.9%
Refusal precision	90%	78%	85%
Refusal recall	85%	72%	80%
Escalation to human agent	<12%	15%	10%
User satisfaction (CSAT)	4.4	4.1	4.0

The Problem

The new model is more helpful, but it appears less safe and less reliable. Leadership wants to know whether the model is acceptable for broader rollout and what evaluation framework should be used to balance answer quality with safety risk.

Requirements

Interpret the metrics and explain the main tradeoffs between quality and safety.
Identify which metrics are most important for launch approval and why.
Diagnose likely failure modes behind the drop in compliance and refusal performance.
Recommend an evaluation plan covering offline review, human evaluation, and production monitoring.
Propose specific actions to improve both output quality and safety.

Constraints

The assistant handles regulated insurance content.
Unsafe or non-compliant answers can trigger legal and reputational risk.
Human review budget is limited to 2,000 sampled conversations per week.

Context

Current Performance

Metric	Internal Target	Current Model	Previous Model
Helpfulness score (1-5)	4.2	4.3	4.0
Factual accuracy	95%	88%	84%
Policy compliance pass rate	99%	96%	98%
Harmful output rate	<0.5%	1.8%	0.9%
Refusal precision	90%	78%	85%
Refusal recall	85%	72%	80%
Escalation to human agent	<12%	15%	10%
User satisfaction (CSAT)	4.4	4.1	4.0

The Problem

Requirements

Interpret the metrics and explain the main tradeoffs between quality and safety.
Identify which metrics are most important for launch approval and why.
Diagnose likely failure modes behind the drop in compliance and refusal performance.
Recommend an evaluation plan covering offline review, human evaluation, and production monitoring.
Propose specific actions to improve both output quality and safety.

Constraints

The assistant handles regulated insurance content.
Unsafe or non-compliant answers can trigger legal and reputational risk.
Human review budget is limited to 2,000 sampled conversations per week.

Context

Current Performance

Metric	Internal Target	Current Model	Previous Model
Helpfulness score (1-5)	4.2	4.3	4.0
Factual accuracy	95%	88%	84%
Policy compliance pass rate	99%	96%	98%
Harmful output rate	<0.5%	1.8%	0.9%
Refusal precision	90%	78%	85%
Refusal recall	85%	72%	80%
Escalation to human agent	<12%	15%	10%
User satisfaction (CSAT)	4.4	4.1	4.0

The Problem

Requirements

Interpret the metrics and explain the main tradeoffs between quality and safety.
Identify which metrics are most important for launch approval and why.
Diagnose likely failure modes behind the drop in compliance and refusal performance.
Recommend an evaluation plan covering offline review, human evaluation, and production monitoring.
Propose specific actions to improve both output quality and safety.

Constraints

The assistant handles regulated insurance content.
Unsafe or non-compliant answers can trigger legal and reputational risk.
Human review budget is limited to 2,000 sampled conversations per week.

Context

Current Performance

Metric	Internal Target	Current Model	Previous Model
Helpfulness score (1-5)	4.2	4.3	4.0
Factual accuracy	95%	88%	84%
Policy compliance pass rate	99%	96%	98%
Harmful output rate	<0.5%	1.8%	0.9%
Refusal precision	90%	78%	85%
Refusal recall	85%	72%	80%
Escalation to human agent	<12%	15%	10%
User satisfaction (CSAT)	4.4	4.1	4.0

The Problem

Requirements

Interpret the metrics and explain the main tradeoffs between quality and safety.
Identify which metrics are most important for launch approval and why.
Diagnose likely failure modes behind the drop in compliance and refusal performance.
Recommend an evaluation plan covering offline review, human evaluation, and production monitoring.
Propose specific actions to improve both output quality and safety.

Constraints

The assistant handles regulated insurance content.
Unsafe or non-compliant answers can trigger legal and reputational risk.
Human review budget is limited to 2,000 sampled conversations per week.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate GenAI Quality and Safety

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Evaluate GenAI Quality and Safety

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate GenAI Quality and Safety

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer