Product Context
VoxCall is a voice platform that provides live speech-to-text for video meetings, call-center assist, and mobile voice input. You need to design an ML system that streams partial transcripts in real time and final transcripts at utterance end for millions of concurrent users.
Scale
| Signal | Value |
|---|
| DAU | 45M |
| Peak concurrent streams | 3.5M |
| Peak audio chunk QPS | 14M chunks/sec (250ms chunks) |
| Supported languages | 18 at launch |
| Average session length | 11 minutes |
| p99 partial-transcript latency | 300ms from chunk arrival |
| p99 finalization latency | 800ms after utterance end |
| Training audio corpus | 55M hours labeled + weakly labeled |
Task
- Clarify product requirements, user segments, and the accuracy/latency tradeoffs for live transcription.
- Design the end-to-end architecture, including streaming ingestion, candidate generation / decoding, rescoring, and final transcript assembly.
- Propose model choices for each stage and explain online vs batch components, feature storage, and adaptation for speaker/language/domain variation.
- Define the training pipeline, labeling strategy, and how logs flow back into retraining while avoiding training-serving skew.
- Specify offline and online evaluation, rollout strategy, and monitoring for drift, regressions, and outages.
- Identify major failure modes and mitigations at scale.
Constraints
- Partial transcripts must appear quickly enough for live captions; accuracy can improve in later revisions.
- GPU capacity is limited, so the system must mix CPU and GPU inference efficiently.
- Some enterprise customers require data residency by region and opt out of storing raw audio beyond 24 hours.
- The system must support code-switching, background noise, and new proper nouns with limited labeled data.
- Cost target: average inference cost below $0.0025 per audio minute.