ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel low-bitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.\footnote{http://dongchaoyang.top/ALMTokenizer/}
Lay Summary: In recent years, we've seen remarkable advances in artificial intelligence models that understand and generate text. But what if these same models could also understand and generate audio—such as music, speech, and everyday sounds? To make this possible, a poential solution is that audio first be transformed into a format that language model can work with: small, discrete units called tokens, similar to words in a sentence. This process is called audio tokenization. Our work introduces ALMTokenizer, a new tool that turns audio into compact, information-rich tokens. It is designed to help audio language models (similar to ChatGPT but for audio) perform better across tasks like speech recognition, text-to-speech synthesis, sound effect generation, and music captioning. Unlike previous methods that compress audio frame-by-frame, ALMTokenizer takes a smarter approach: it looks at larger chunks of audio and learns to summarize the most important information using a small number of trainable queries. This not only saves storage space (low bitrate) but also preserves meaning and context (semantic richness). We also design new ways to teach our model to focus on meaningful information in audio. These include training it to predict missing parts of audio and optimizing it to better represent the sound structure that language models rely on. As a result, ALMTokenizer delivers better audio quality at lower bitrate and helps AI models understand and generate sound more accurately. In short, ALMTokenizer moves us closer to general-purpose AI that can handle audio as well as it handles text—efficiently, meaningfully, and with high quality.
Primary Area: Applications->Language, Speech and Dialog
Keywords: Audio Language Models, Audio Codec, Audio Tokenizer, Audio Understanding and Generation
Submission Number: 1066
Loading