Real-time voice translation looked amazing in demos, but in practice it struggled with accents, technical jargon, and context. The demos were clearly done in controlled environments with clear speakers and simple topics.
The reason? Training data bias and the "last mile" problem - demos use ideal conditions while real usage involves messy audio, overlapping speech, and domain-specific vocabulary the models never saw during training.