Business Context
VoxDesk, a customer support platform, uses automatic speech recognition (ASR) to transcribe inbound call recordings before downstream analytics and agent-assist workflows. The NLP team wants a practical evaluation pipeline that measures transcript quality and flags calls where recognition errors are likely to break intent detection and keyword extraction.
Data
- Volume: 180,000 English call recordings with human reference transcripts; ~25,000 new calls per day
- Input: 8 kHz mono telephony audio and ASR-generated transcripts
- Text length: 1-40 utterances per call; median transcript length is 95 tokens
- Domain: customer support, billing, cancellations, address changes, payment failures
- Label distribution: 3 quality labels derived from WER bands — Good (58%), Review (29%), Poor (13%)
Success Criteria
A strong solution should achieve macro-F1 >= 0.84 on transcript quality classification, recall >= 0.92 on the Poor class, and produce call-level diagnostics that help operations teams understand common ASR failure modes.
Constraints
- Inference must run in <150 ms per transcript for batch QA scoring
- Solution must fit on a single T4 GPU or CPU-only fallback
- No external API calls; all processing must run in a secure environment
Requirements
- Build an NLP pipeline that classifies ASR transcripts into Good / Review / Poor quality bands.
- Use realistic preprocessing for conversational transcripts, including disfluencies, timestamps, and ASR artifacts.
- Implement a modern Python solution using
transformers and standard evaluation tooling.
- Report metrics by class and explain how you would handle class imbalance.
- Describe how transcript quality errors would affect downstream NLP tasks such as intent classification and entity extraction.
- Include an error analysis plan for numbers, names, accents, and domain-specific vocabulary.