Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce ResGen, an efficient Residual Vector Quantization (RVQ)-based generative model for high-fidelity generation with fast sampling. RVQ improves data fidelity by increasing the number of quantization steps, referred to as depth, but deeper quantization typically increases inference steps in generative models. To address this, ResGen directly predicts the vector embedding of collective tokens rather than individual ones, ensuring that inference steps remain independent of RVQ depth. Additionally, we formulate token masking and multi-token prediction within a probabilistic framework using discrete diffusion and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation on ImageNet 256×256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models.
Lay Summary: Scaling data and model size makes generative AI outputs more realistic, but it also stretches generation time and energy budgets. A common remedy is to encode raw data into compact codes; Residual Vector Quantization (RVQ) stands out for packing detail into short code sequences. Unfortunately, existing RVQ-based generators still slow down sampling time when deeper depths are used. We introduce ResGen, an efficient RVQ-based generative modeling framework that directly predicts the vector embedding of collective tokens rather than individual ones, ensuring that inference steps remain independent of RVQ depth. Extensive ablation studies confirm that our modeling is especially well-matched to RVQ tokens. On ImageNet 256 × 256 image generation and zero-shot text-to-speech, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared with similarly sized baseline models. This efficiency slashes GPU hours, lowers carbon cost, and opens the door to real-time, on-device generative applications.
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: generative models, discrete diffusion, residual vector quantization, transformer, text-to-speech
Submission Number: 1337
Loading