Evaluate GenAI Quality and Safety

Context

BrightAssist is deploying a customer-support generative AI model that answers billing and account questions in a fintech app. In offline evaluation, the model produces fluent responses, but the trust and safety team found cases of incorrect financial advice, policy violations, and unsafe escalation handling.

Current Performance

Metric	Current Model	Target	Notes
Helpfulness pass rate	78%	85%	Human-rated on 1,200 prompts
Factual accuracy	74%	90%	Grounded against internal KB
Policy safety pass rate	96.2%	99.0%	Includes harmful/regulated content checks
Hallucination rate	11.5%	5.0%	Unsupported claims in final answer
Escalation recall	68%	90%	Cases that should be handed to a human
Over-refusal rate	14%	7%	Safe requests incorrectly declined
Avg. response latency	2.1s	2.5s	Within SLA

The Problem

Leadership wants a practical evaluation framework that measures both answer quality and safety before launch. The current metrics suggest the model is usable for simple requests but unreliable in high-risk situations, especially where escalation or factual grounding is required.

Requirements

Define an evaluation approach covering both quality and safety.
Interpret the current metrics and identify the biggest launch risks.
Recommend how to segment evaluation by use case, risk level, and prompt type.
Propose thresholding or gating rules for launch readiness.
Suggest concrete improvements to raise quality without weakening safety.

Constraints

The product handles regulated financial topics.
Human review capacity is limited to 8% of daily conversations.
Launch cannot increase average latency above 2.5 seconds.
A severe unsafe response is considered more costly than an unnecessary refusal.

Context

Current Performance

Metric	Current Model	Target	Notes
Helpfulness pass rate	78%	85%	Human-rated on 1,200 prompts
Factual accuracy	74%	90%	Grounded against internal KB
Policy safety pass rate	96.2%	99.0%	Includes harmful/regulated content checks
Hallucination rate	11.5%	5.0%	Unsupported claims in final answer
Escalation recall	68%	90%	Cases that should be handed to a human
Over-refusal rate	14%	7%	Safe requests incorrectly declined
Avg. response latency	2.1s	2.5s	Within SLA

The Problem

Requirements

Define an evaluation approach covering both quality and safety.
Interpret the current metrics and identify the biggest launch risks.
Recommend how to segment evaluation by use case, risk level, and prompt type.
Propose thresholding or gating rules for launch readiness.
Suggest concrete improvements to raise quality without weakening safety.

Constraints

The product handles regulated financial topics.
Human review capacity is limited to 8% of daily conversations.
Launch cannot increase average latency above 2.5 seconds.
A severe unsafe response is considered more costly than an unnecessary refusal.

Context

Current Performance

Metric	Current Model	Target	Notes
Helpfulness pass rate	78%	85%	Human-rated on 1,200 prompts
Factual accuracy	74%	90%	Grounded against internal KB
Policy safety pass rate	96.2%	99.0%	Includes harmful/regulated content checks
Hallucination rate	11.5%	5.0%	Unsupported claims in final answer
Escalation recall	68%	90%	Cases that should be handed to a human
Over-refusal rate	14%	7%	Safe requests incorrectly declined
Avg. response latency	2.1s	2.5s	Within SLA

The Problem

Requirements

Define an evaluation approach covering both quality and safety.
Interpret the current metrics and identify the biggest launch risks.
Recommend how to segment evaluation by use case, risk level, and prompt type.
Propose thresholding or gating rules for launch readiness.
Suggest concrete improvements to raise quality without weakening safety.

Constraints

The product handles regulated financial topics.
Human review capacity is limited to 8% of daily conversations.
Launch cannot increase average latency above 2.5 seconds.
A severe unsafe response is considered more costly than an unnecessary refusal.

Context

Current Performance

Metric	Current Model	Target	Notes
Helpfulness pass rate	78%	85%	Human-rated on 1,200 prompts
Factual accuracy	74%	90%	Grounded against internal KB
Policy safety pass rate	96.2%	99.0%	Includes harmful/regulated content checks
Hallucination rate	11.5%	5.0%	Unsupported claims in final answer
Escalation recall	68%	90%	Cases that should be handed to a human
Over-refusal rate	14%	7%	Safe requests incorrectly declined
Avg. response latency	2.1s	2.5s	Within SLA

The Problem

Requirements

Define an evaluation approach covering both quality and safety.
Interpret the current metrics and identify the biggest launch risks.
Recommend how to segment evaluation by use case, risk level, and prompt type.
Propose thresholding or gating rules for launch readiness.
Suggest concrete improvements to raise quality without weakening safety.

Constraints

The product handles regulated financial topics.
Human review capacity is limited to 8% of daily conversations.
Launch cannot increase average latency above 2.5 seconds.
A severe unsafe response is considered more costly than an unnecessary refusal.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate GenAI Quality and Safety

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Evaluate GenAI Quality and Safety

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate GenAI Quality and Safety

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer