PolyVoice: Language Models for Speech to Speech Translation

Qian qian Dong; Zhiying Huang; Qiao Tian; Chen Xu; Tom Ko; yunlong zhao; Siyuan Feng; Tang Li; Kexin Wang; Xuxin Cheng; Fengpeng Yue; Ye Bai; Xi Chen; Lu Lu; Zejun MA; Yuping Wang; Mingxuan Wang; Yuxuan Wang

PolyVoice: Language Models for Speech to Speech Translation

Qian qian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, yunlong zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun MA, Yuping Wang, Mingxuan Wang, Yuxuan Wang

Published: 16 Jan 2024, Last Modified: 13 Apr 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Speech-to-Speecn Translatiom, Audio Language Model, Voice Clone

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: In this study, we introduce PolyVoice, a language model-based framework designed for S2ST systems.

Abstract: With the huge success of GPT models in natural language processing, there is a growing interest in applying language modeling approaches to speech tasks. Currently, the dominant architecture in speech-to-speech translation (S2ST) remains the encoder-decoder paradigm, creating a need to investigate the impact of language modeling approaches in this area. In this study, we introduce PolyVoice, a language model-based framework designed for S2ST systems. Our framework comprises three decoder-only language models: a translation language model, a duration language model, and a speech synthesis language model. These language models employ different types of prompts to extract learned information effectively. By utilizing unsupervised semantic units, our framework can transfer semantic information across these models, making it applicable even to unwritten languages. We evaluate our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish language pairs. Experimental results demonstrate that \method outperforms the state-of-the-art encoder-decoder model, producing voice-cloned speech with high translation and audio quality. Speech samples are available at https://polyvoice.github.io.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: general machine learning (i.e., none of the above)

Submission Number: 7518

Loading