Generative Spoken Language Model based on continuous word-sized audio tokens

Robin Jonathan Algayres; Yossi Adi; Tu Anh Nguyen; Jade Copet; Gabriel Synnaeve; Benoît Sagot; Emmanuel Dupoux

Generative Spoken Language Model based on continuous word-sized audio tokens

Robin Jonathan Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel Synnaeve, Benoît Sagot, Emmanuel Dupoux

22 Sept 2022 (modified: 15 Jan 2026)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: spoken language model, sentence generation, speech synthesis, k nearest neighbors, unsupervised learning, textless technology

TL;DR: We introduced a generative spoken language model based on continuous word-sized acoustic tokens.

Abstract: In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio tokens that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous tokens. Its performance is on par with discrete unit GSLMs regarding generation quality and zero resource challenge metrics. Moreover, it is five times more memory efficient because of its larger units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Generative models

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/generative-spoken-language-model-based-on/code)

11 Replies

Loading