Interview Guides

Design Real-Time Speech Recognition

Hard

ML System Design

Product Context

VoxCall is a voice platform that provides live speech-to-text for video meetings, call-center assist, and mobile voice input. You need to design an ML system that streams partial transcripts in real time and final transcripts at utterance end for millions of concurrent users.

Scale

Signal	Value
DAU	45M
Peak concurrent streams	3.5M
Peak audio chunk QPS	14M chunks/sec (250ms chunks)
Supported languages	18 at launch
Average session length	11 minutes
p99 partial-transcript latency	300ms from chunk arrival
p99 finalization latency	800ms after utterance end
Training audio corpus	55M hours labeled + weakly labeled

Task

Clarify product requirements, user segments, and the accuracy/latency tradeoffs for live transcription.
Design the end-to-end architecture, including streaming ingestion, candidate generation / decoding, rescoring, and final transcript assembly.
Propose model choices for each stage and explain online vs batch components, feature storage, and adaptation for speaker/language/domain variation.
Define the training pipeline, labeling strategy, and how logs flow back into retraining while avoiding training-serving skew.
Specify offline and online evaluation, rollout strategy, and monitoring for drift, regressions, and outages.
Identify major failure modes and mitigations at scale.

Constraints

Partial transcripts must appear quickly enough for live captions; accuracy can improve in later revisions.
GPU capacity is limited, so the system must mix CPU and GPU inference efficiently.
Some enterprise customers require data residency by region and opt out of storing raw audio beyond 24 hours.
The system must support code-switching, background noise, and new proper nouns with limited labeled data.
Cost target: average inference cost below $0.0025 per audio minute.

Design Real-Time Speech Recognition

Hard

ML System Design

Product Context

Scale

Signal	Value
DAU	45M
Peak concurrent streams	3.5M
Peak audio chunk QPS	14M chunks/sec (250ms chunks)
Supported languages	18 at launch
Average session length	11 minutes
p99 partial-transcript latency	300ms from chunk arrival
p99 finalization latency	800ms after utterance end
Training audio corpus	55M hours labeled + weakly labeled

Task

Clarify product requirements, user segments, and the accuracy/latency tradeoffs for live transcription.
Design the end-to-end architecture, including streaming ingestion, candidate generation / decoding, rescoring, and final transcript assembly.
Propose model choices for each stage and explain online vs batch components, feature storage, and adaptation for speaker/language/domain variation.
Define the training pipeline, labeling strategy, and how logs flow back into retraining while avoiding training-serving skew.
Specify offline and online evaluation, rollout strategy, and monitoring for drift, regressions, and outages.
Identify major failure modes and mitigations at scale.

Constraints

Partial transcripts must appear quickly enough for live captions; accuracy can improve in later revisions.
GPU capacity is limited, so the system must mix CPU and GPU inference efficiently.
Some enterprise customers require data residency by region and opt out of storing raw audio beyond 24 hours.
The system must support code-switching, background noise, and new proper nouns with limited labeled data.
Cost target: average inference cost below $0.0025 per audio minute.

Your Answer

Design Real-Time Speech Recognition

Hard

ML System Design

Product Context

Scale

Signal	Value
DAU	45M
Peak concurrent streams	3.5M
Peak audio chunk QPS	14M chunks/sec (250ms chunks)
Supported languages	18 at launch
Average session length	11 minutes
p99 partial-transcript latency	300ms from chunk arrival
p99 finalization latency	800ms after utterance end
Training audio corpus	55M hours labeled + weakly labeled

Task

Clarify product requirements, user segments, and the accuracy/latency tradeoffs for live transcription.
Design the end-to-end architecture, including streaming ingestion, candidate generation / decoding, rescoring, and final transcript assembly.
Propose model choices for each stage and explain online vs batch components, feature storage, and adaptation for speaker/language/domain variation.
Define the training pipeline, labeling strategy, and how logs flow back into retraining while avoiding training-serving skew.
Specify offline and online evaluation, rollout strategy, and monitoring for drift, regressions, and outages.
Identify major failure modes and mitigations at scale.

Constraints

Partial transcripts must appear quickly enough for live captions; accuracy can improve in later revisions.
GPU capacity is limited, so the system must mix CPU and GPU inference efficiently.
Some enterprise customers require data residency by region and opt out of storing raw audio beyond 24 hours.
The system must support code-switching, background noise, and new proper nouns with limited labeled data.
Cost target: average inference cost below $0.0025 per audio minute.

Design Real-Time Speech Recognition

Hard

ML System Design

Product Context

Scale

Signal	Value
DAU	45M
Peak concurrent streams	3.5M
Peak audio chunk QPS	14M chunks/sec (250ms chunks)
Supported languages	18 at launch
Average session length	11 minutes
p99 partial-transcript latency	300ms from chunk arrival
p99 finalization latency	800ms after utterance end
Training audio corpus	55M hours labeled + weakly labeled

Task

Clarify product requirements, user segments, and the accuracy/latency tradeoffs for live transcription.
Design the end-to-end architecture, including streaming ingestion, candidate generation / decoding, rescoring, and final transcript assembly.
Propose model choices for each stage and explain online vs batch components, feature storage, and adaptation for speaker/language/domain variation.
Define the training pipeline, labeling strategy, and how logs flow back into retraining while avoiding training-serving skew.
Specify offline and online evaluation, rollout strategy, and monitoring for drift, regressions, and outages.
Identify major failure modes and mitigations at scale.

Constraints

Partial transcripts must appear quickly enough for live captions; accuracy can improve in later revisions.
GPU capacity is limited, so the system must mix CPU and GPU inference efficiently.
Some enterprise customers require data residency by region and opt out of storing raw audio beyond 24 hours.
The system must support code-switching, background noise, and new proper nouns with limited labeled data.
Cost target: average inference cost below $0.0025 per audio minute.