Evaluate ASR and Summarization Metrics

Context

VoxNote is evaluating two production language systems: an automatic speech recognition (ASR) model for meeting transcripts and an NLP summarization model that generates meeting recaps. Product complaints increased after a recent model refresh: transcript quality is worse on noisy calls, while summaries are shorter but omit action items.

Current Performance

Model	Metric	Baseline	Current	Change
ASR	Word Error Rate (WER)	11.8%	16.4%	+4.6 pts
ASR	Sentence Error Rate (SER)	21.0%	29.7%	+8.7 pts
ASR	Named Entity Error Rate	9.5%	18.2%	+8.7 pts
ASR	Real-time factor	0.72	0.61	Improved
Summarization	ROUGE-1	0.462	0.438	-0.024
Summarization	ROUGE-L	0.401	0.356	-0.045
Summarization	BLEU	0.214	0.191	-0.023
Summarization	Human factuality score (1-5)	4.3	3.6	-0.7
Summarization	Action item recall	0.81	0.63	-0.18

The Problem

Leadership wants to know which metrics should be used to evaluate each system, how to interpret the current results, and whether the refresh should be rolled back or improved in-place.

Requirements

Explain which metrics are most appropriate for ASR versus summarization and why.
Interpret the metric changes and identify the most likely quality regressions.
Recommend an error analysis plan for both models.
Propose specific model or evaluation improvements.
Discuss which metrics should be primary for launch decisions versus secondary diagnostics.

Constraints

ASR latency must stay below a real-time factor of 0.70.
Summaries must fit within 120 words on average.
Enterprise customers care disproportionately about names, dates, and action items.

Context

Current Performance

Model	Metric	Baseline	Current	Change
ASR	Word Error Rate (WER)	11.8%	16.4%	+4.6 pts
ASR	Sentence Error Rate (SER)	21.0%	29.7%	+8.7 pts
ASR	Named Entity Error Rate	9.5%	18.2%	+8.7 pts
ASR	Real-time factor	0.72	0.61	Improved
Summarization	ROUGE-1	0.462	0.438	-0.024
Summarization	ROUGE-L	0.401	0.356	-0.045
Summarization	BLEU	0.214	0.191	-0.023
Summarization	Human factuality score (1-5)	4.3	3.6	-0.7
Summarization	Action item recall	0.81	0.63	-0.18

The Problem

Leadership wants to know which metrics should be used to evaluate each system, how to interpret the current results, and whether the refresh should be rolled back or improved in-place.

Requirements

Explain which metrics are most appropriate for ASR versus summarization and why.
Interpret the metric changes and identify the most likely quality regressions.
Recommend an error analysis plan for both models.
Propose specific model or evaluation improvements.
Discuss which metrics should be primary for launch decisions versus secondary diagnostics.

Constraints

ASR latency must stay below a real-time factor of 0.70.
Summaries must fit within 120 words on average.
Enterprise customers care disproportionately about names, dates, and action items.

Context

Current Performance

Model	Metric	Baseline	Current	Change
ASR	Word Error Rate (WER)	11.8%	16.4%	+4.6 pts
ASR	Sentence Error Rate (SER)	21.0%	29.7%	+8.7 pts
ASR	Named Entity Error Rate	9.5%	18.2%	+8.7 pts
ASR	Real-time factor	0.72	0.61	Improved
Summarization	ROUGE-1	0.462	0.438	-0.024
Summarization	ROUGE-L	0.401	0.356	-0.045
Summarization	BLEU	0.214	0.191	-0.023
Summarization	Human factuality score (1-5)	4.3	3.6	-0.7
Summarization	Action item recall	0.81	0.63	-0.18

The Problem

Leadership wants to know which metrics should be used to evaluate each system, how to interpret the current results, and whether the refresh should be rolled back or improved in-place.

Requirements

Explain which metrics are most appropriate for ASR versus summarization and why.
Interpret the metric changes and identify the most likely quality regressions.
Recommend an error analysis plan for both models.
Propose specific model or evaluation improvements.
Discuss which metrics should be primary for launch decisions versus secondary diagnostics.

Constraints

ASR latency must stay below a real-time factor of 0.70.
Summaries must fit within 120 words on average.
Enterprise customers care disproportionately about names, dates, and action items.

Context

Current Performance

Model	Metric	Baseline	Current	Change
ASR	Word Error Rate (WER)	11.8%	16.4%	+4.6 pts
ASR	Sentence Error Rate (SER)	21.0%	29.7%	+8.7 pts
ASR	Named Entity Error Rate	9.5%	18.2%	+8.7 pts
ASR	Real-time factor	0.72	0.61	Improved
Summarization	ROUGE-1	0.462	0.438	-0.024
Summarization	ROUGE-L	0.401	0.356	-0.045
Summarization	BLEU	0.214	0.191	-0.023
Summarization	Human factuality score (1-5)	4.3	3.6	-0.7
Summarization	Action item recall	0.81	0.63	-0.18

The Problem

Leadership wants to know which metrics should be used to evaluate each system, how to interpret the current results, and whether the refresh should be rolled back or improved in-place.

Requirements

Explain which metrics are most appropriate for ASR versus summarization and why.
Interpret the metric changes and identify the most likely quality regressions.
Recommend an error analysis plan for both models.
Propose specific model or evaluation improvements.
Discuss which metrics should be primary for launch decisions versus secondary diagnostics.

Constraints

ASR latency must stay below a real-time factor of 0.70.
Summaries must fit within 120 words on average.
Enterprise customers care disproportionately about names, dates, and action items.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate ASR and Summarization Metrics

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Evaluate ASR and Summarization Metrics

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate ASR and Summarization Metrics

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer