Zero-Shot Speech-to-Speech Translation without Parallel Speech
Keywords: speech-to-speech translation, zero-shot, neural machine translation, end-to-end, monolingual data
TL;DR: RosettaSpeech is a zero-shot, end-to-end speech-to-speech translator trained only on monolingual speech-text that sets new SOTA on CVSS.
Abstract: End-to-end speech-to-speech translation (S2ST) systems face a major bottleneck: parallel speech-to-speech data is scarce. We introduce **RosettaSpeech**, a zero-shot framework trained solely on monolingual speech-text data, augmented with machine translation supervision. Instead of cascaded pseudo-labeling, RosettaSpeech uses text as a semantic bridge to synthesize translation targets during training, removing the need for parallel speech pairs while preserving direct end-to-end inference. On the CVSS-C benchmark, RosettaSpeech achieves state-of-the-art zero-shot results, reaching ASR-BLEU 25.17 on German-to-English (+27% relative) and 29.86 on Spanish-to-English (+14%). Notably, it preserves the source speaker’s voice without paired speech supervision. We further study data scaling and demonstrate strong many-to-one translation performance, enabling scalable S2ST for "text-rich, speech-poor" languages.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 40
Loading