Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction

Shu-wen Yang; Byeonggeun Kim; Kuan-Po Huang; Qingming Tang; HUY PHAN; Bo-Ru Lu; Harshavardhan Sundar; Shalini Ghosh; Hung-yi Lee; Chieh-Chi Kao; Chao Wang

Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction

Shu-wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, HUY PHAN, Bo-Ru Lu, Harshavardhan Sundar, Shalini Ghosh, Hung-yi Lee, Chieh-Chi Kao, Chao Wang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: We replace discrete tokens with continuous latents for audio language modeling, enhanced by a novel masked next-token prediction task.

Abstract: Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters—193M for our Base and 462M for our Large models.

Lay Summary: Have you ever wondered how computers can generate realistic sounds, like music or environmental noises? Our research pushes this ability forward by teaching AI to create audio in a smoother and more natural way—more like how humans experience sound. Traditionally, powerful AI systems like ChatGPT learn to predict the next word when writing sentences. We use a similar method, but instead of words, we teach the AI to generate the next tiny slice of sound. This is especially tricky because sound isn’t made of clear-cut pieces like words—it’s continuous and complex. To tackle this, we designed a new method that helps the AI better understand and produce these continuous sound waves. As a result, our system creates audio that sounds much more natural than older methods, especially compared to a popular model called AudioGen. It’s also more efficient: our models are smaller and faster, but still match or even outperform state-of-the-art systems in quality. In short, we’ve taken a big step toward helping AI generate high-quality audio in a smarter and more efficient way—bringing us closer to lifelike sound generation for games, movies, accessibility tools, and beyond.

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: text-to-audio, audio language modeling, continuous token, diffusion model, next-token prediction

Submission Number: 7312

Loading