Abstract: Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation has been an essential issue to be addressed, especially for the AR models that are incorporated in the large language models (LLMs). In this paper, we analyze the information flow and propose a novel Scale-level Audio Tokenizer (SAT), with improved multi-scale residual quantization. Based on SAT, a scale-level Acoustic AutoRegressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction thus significantly saving the inference time and training cost. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable 35x faster and +1.331 FAD against baselines on the AudioSet benchmark. Code and pre-trained checkpoints will be released to facilitate audio generation research.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Autoregressive, Audio Generation, Tokenizer
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1277
Loading