Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

TMLR Paper7513 Authors

14 Feb 2026 (modified: 06 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: While discrete tokenizers are suspected to inherently limit sample diversity in token-based generative models, we show this diversity gap is not caused by discretization itself, but rooted in the timing of quantization. In this study, we systematically identify quantization in the initial stage as the primary catalyst for a representational misalignment, where the codebook prematurely shrinks into a narrow latent manifold. This initial shrinkage prevents the codebook from capturing the diverse embedding space of the encoder. Though this may yield deceptively strong reconstructions, it creates a bottleneck that forces the generator to rely on a homogenized set of tokens. Ultimately, the codebook’s failure to anchor to robust representations at the onset of training impairs generative variety and limits sample diversity. To address this, we propose Deferred Quantization, a simple yet effective strategy that introduces a separate continuous learning phase. By allowing the encoder to first establish a well-distributed representation space before introducing discretization, the codebook can effectively anchor to a mature and diverse latent landscape. Across tokenizers and token-based generators, Deferred Quantization consistently reduces shrinkage, improves generative diversity, and preserves reconstruction and compression. We additionally provide a shrinkage diagnostic suite and offer practical guidance for designing diversity-preserving discrete tokenizers.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Hao_Tang1
Submission Number: 7513
Loading