Evaluate Speech Recognition Transcripts

Business Context

VoxDesk, a customer support platform, uses automatic speech recognition (ASR) to transcribe inbound call recordings before downstream analytics and agent-assist workflows. The NLP team wants a practical evaluation pipeline that measures transcript quality and flags calls where recognition errors are likely to break intent detection and keyword extraction.

Data

Volume: 180,000 English call recordings with human reference transcripts; ~25,000 new calls per day
Input: 8 kHz mono telephony audio and ASR-generated transcripts
Text length: 1-40 utterances per call; median transcript length is 95 tokens
Domain: customer support, billing, cancellations, address changes, payment failures
Label distribution: 3 quality labels derived from WER bands — Good (58%), Review (29%), Poor (13%)

Success Criteria

A strong solution should achieve macro-F1 >= 0.84 on transcript quality classification, recall >= 0.92 on the Poor class, and produce call-level diagnostics that help operations teams understand common ASR failure modes.

Constraints

Inference must run in <150 ms per transcript for batch QA scoring
Solution must fit on a single T4 GPU or CPU-only fallback
No external API calls; all processing must run in a secure environment

Requirements

Build an NLP pipeline that classifies ASR transcripts into Good / Review / Poor quality bands.
Use realistic preprocessing for conversational transcripts, including disfluencies, timestamps, and ASR artifacts.
Implement a modern Python solution using transformers and standard evaluation tooling.
Report metrics by class and explain how you would handle class imbalance.
Describe how transcript quality errors would affect downstream NLP tasks such as intent classification and entity extraction.
Include an error analysis plan for numbers, names, accents, and domain-specific vocabulary.

Business Context

Data

Volume: 180,000 English call recordings with human reference transcripts; ~25,000 new calls per day

Input: 8 kHz mono telephony audio and ASR-generated transcripts

Text length: 1-40 utterances per call; median transcript length is 95 tokens

Domain: customer support, billing, cancellations, address changes, payment failures

Label distribution: 3 quality labels derived from WER bands — Good (58%), Review (29%), Poor (13%)

Requirements

Build an NLP pipeline that classifies ASR transcripts into Good / Review / Poor quality bands.

Use realistic preprocessing for conversational transcripts, including disfluencies, timestamps, and ASR artifacts.

Implement a modern Python solution using transformers and standard evaluation tooling.

Report metrics by class and explain how you would handle class imbalance.

Describe how transcript quality errors would affect downstream NLP tasks such as intent classification and entity extraction.

Include an error analysis plan for numbers, names, accents, and domain-specific vocabulary.

Business Context

Data

Volume: 180,000 English call recordings with human reference transcripts; ~25,000 new calls per day

Input: 8 kHz mono telephony audio and ASR-generated transcripts

Text length: 1-40 utterances per call; median transcript length is 95 tokens

Domain: customer support, billing, cancellations, address changes, payment failures

Label distribution: 3 quality labels derived from WER bands — Good (58%), Review (29%), Poor (13%)

Requirements

Build an NLP pipeline that classifies ASR transcripts into Good / Review / Poor quality bands.

Use realistic preprocessing for conversational transcripts, including disfluencies, timestamps, and ASR artifacts.

Implement a modern Python solution using transformers and standard evaluation tooling.

Report metrics by class and explain how you would handle class imbalance.

Describe how transcript quality errors would affect downstream NLP tasks such as intent classification and entity extraction.

Include an error analysis plan for numbers, names, accents, and domain-specific vocabulary.

Business Context

Data

Volume: 180,000 English call recordings with human reference transcripts; ~25,000 new calls per day

Input: 8 kHz mono telephony audio and ASR-generated transcripts

Text length: 1-40 utterances per call; median transcript length is 95 tokens

Domain: customer support, billing, cancellations, address changes, payment failures

Label distribution: 3 quality labels derived from WER bands — Good (58%), Review (29%), Poor (13%)

Requirements

Build an NLP pipeline that classifies ASR transcripts into Good / Review / Poor quality bands.

Use realistic preprocessing for conversational transcripts, including disfluencies, timestamps, and ASR artifacts.

Implement a modern Python solution using transformers and standard evaluation tooling.

Report metrics by class and explain how you would handle class imbalance.

Describe how transcript quality errors would affect downstream NLP tasks such as intent classification and entity extraction.

Include an error analysis plan for numbers, names, accents, and domain-specific vocabulary.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Evaluate Speech Recognition Transcripts

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Evaluate Speech Recognition Transcripts

Business Context

Data

Success Criteria

Constraints

Requirements

Evaluate Speech Recognition Transcripts

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer