VoxNote is evaluating two production language systems: an automatic speech recognition (ASR) model for meeting transcripts and an NLP summarization model that generates meeting recaps. Product complaints increased after a recent model refresh: transcript quality is worse on noisy calls, while summaries are shorter but omit action items.
| Model | Metric | Baseline | Current | Change |
|---|---|---|---|---|
| ASR | Word Error Rate (WER) | 11.8% | 16.4% | +4.6 pts |
| ASR | Sentence Error Rate (SER) | 21.0% | 29.7% | +8.7 pts |
| ASR | Named Entity Error Rate | 9.5% | 18.2% | +8.7 pts |
| ASR | Real-time factor | 0.72 | 0.61 | Improved |
| Summarization | ROUGE-1 | 0.462 | 0.438 | -0.024 |
| Summarization | ROUGE-L | 0.401 | 0.356 | -0.045 |
| Summarization | BLEU | 0.214 | 0.191 | -0.023 |
| Summarization | Human factuality score (1-5) | 4.3 | 3.6 | -0.7 |
| Summarization | Action item recall | 0.81 | 0.63 | -0.18 |
Leadership wants to know which metrics should be used to evaluate each system, how to interpret the current results, and whether the refresh should be rolled back or improved in-place.