High-Fidelity Simultaneous Speech-To-Speech Translation

Tom Labiausse; Laurent Mazaré; Edouard Grave; Alexandre Défossez; Neil Zeghidour

High-Fidelity Simultaneous Speech-To-Speech Translation

Tom Labiausse, Laurent Mazaré, Edouard Grave, Alexandre Défossez, Neil Zeghidour

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: natural, high-quality simultaneous speech translation with voice preservation

Abstract: We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart --where one waits for the end of the source utterance to start translating-- adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples on *huggingface.co/spaces/kyutai/hibiki-samples* as well as models and inference code at *github.com/kyutai-labs/hibiki*.

Lay Summary: Most speech translation systems today work after a person has finished speaking, which is too slow for real-time conversations. Simultaneous translation --where the system starts translating while the speaker is still talking-- is much harder. It requires smart, split-second decisions about when to translate, how much to wait, and how to keep the translated voice natural and expressive. Until now, machines have struggled to match the performance of human interpreters in this setting. We created Hibiki, a powerful yet simple system that can simultaneously listen and speak. It learns to balance waiting and translating in real time and generates both written and spoken translations. We also developed techniques to train it using synthetic data that sounds natural and stays aligned with the original speaker’s voice and rhythm. Hibiki outperforms past systems in accuracy, speaker similarity, and naturalness, and is the first model to come close to professional human interpretation. It makes real-time, human-like translation more accessible as it can even run on a smartphone. We’re sharing our code, models, and a large dataset to help others build on this progress and bring high-fidelity cross-language communication to more people.

Link To Code: https://github.com/kyutai-labs/hibiki

Primary Area: Applications->Language, Speech and Dialog

Keywords: audio language models, speech translation, multimodal language models, speech-to-speech

Submission Number: 11392

Loading