Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer

Yongxin Zhu; Dan Su; Liqiang He; Linli Xu; Dong Yu

Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer

Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: speech language model, speech generation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: While recent advancements in speech language modeling have achieved significant progress, they face remarkable challenges in modelling the long acoustic sequence of neural audio codecs. Previous speech language models are compelled to learn acoustic tokens through a multi-stage generation process, which hinders their performance due to error propagation and information loss. In this paper, we introduce \textbf{G}enerative \textbf{P}re-Trained \textbf{S}peech Language Model (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of raw audio waveforms in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identity unconditionally. When provided a brief 3-second prompt, GPST is able to produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality and speaker similarity.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7044

Loading