Zero-Shot Speech-to-Speech Translation without Parallel Speech

Zhisheng Zheng; David Harwath

Zero-Shot Speech-to-Speech Translation without Parallel Speech

Zhisheng Zheng, David Harwath

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: speech-to-speech translation, zero-shot, neural machine translation, end-to-end, monolingual data

TL;DR: RosettaSpeech is a zero-shot, end-to-end speech-to-speech translator trained only on monolingual speech-text that sets new SOTA on CVSS.

Abstract: End-to-end speech-to-speech translation (S2ST) systems face a major bottleneck: parallel speech-to-speech data is scarce. We introduce **RosettaSpeech**, a zero-shot framework trained solely on monolingual speech-text data, augmented with machine translation supervision. Instead of cascaded pseudo-labeling, RosettaSpeech uses text as a semantic bridge to synthesize translation targets during training, removing the need for parallel speech pairs while preserving direct end-to-end inference. On the CVSS-C benchmark, RosettaSpeech achieves state-of-the-art zero-shot results, reaching ASR-BLEU 25.17 on German-to-English (+27% relative) and 29.86 on Spanish-to-English (+14%). Notably, it preserves the source speaker’s voice without paired speech supervision. We further study data scaling and demonstrate strong many-to-one translation performance, enabling scalable S2ST for "text-rich, speech-poor" languages.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 40

Loading