SyllableLM: Learning Coarse Semantic Units for Speech Language Models

14 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generative Spoken Language Modeling, Audio, Textless NLP, Representation Learning
TL;DR: We introduce a new self-supervised method to learn semantic speech units (pseudo-syllables) that dramatically lowers bitrate and improves spoken language modeling compared to prior work.
Abstract: Self-Supervised Transformer Models are the backbone of much of the recent progress in deep learning. However, these models require their inputs to be tokenized, and tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering. For speech and audio models in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge, as several times more tokens are used per word than in textual language modeling. In this work, we introduce a controllable, fully-self-supervised technique to dynamically merge speech representations across time to as low as 5 Hz at 60 bits per second while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations between mask spans and model losses and 2) iteratively improving these representations with a novel agglomeration technique. Using these new feature representations, we successfully train SyllableLM, a Neural Codec Language Model (NCLM) competitive with current SoTA NCLMs on a range of common benchmarks with a 30x reduction in pretraining compute, 5x reduction in inference compute, and 2.5x reduction in bitrate.
Primary Area: Speech and audio
Submission Number: 12909
Loading