Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

14 Feb 2026 (modified: 01 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: While discrete tokenizers are suspected to inherently limit sample diversity in token-based generative models, we show this diversity gap is not caused by discretization itself, but rooted in the timing of quantization. In this study, we systematically identify quantization in the initial stage as the primary catalyst for a representational misalignment, where the codebook suffers a premature coverage failure, anchoring to a narrow latent space. This initial coverage deficit prevents the codebook from capturing the diverse embedding space of the encoder.Though this may yield deceptively strong reconstructions, it creates a bottleneck that forces the generator to rely on a homogenized set of tokens. Ultimately, the codebook’s failure to anchor to robust representations at the onset of training impairs generative variety and limits sample diversity.To address this, we propose Deferred Quantization, a simple yet effective strategy that introduces a separate continuous learning phase. By allowing the encoder to first establish a well-distributed representation space before introducing discretization, the codebook can effectively anchor to a mature and diverse latent landscape.Across tokenizers and token-based generators, Deferred Quantization consistently mitigates this coverage failure, improves generative diversity, and preserves reconstruction and compression. We additionally provide a coverage diagnostic suite and offer practical guidance for designing diversity-preserving discrete tokenizers.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: - Add Llamagen training details to Section 6.1 - Add Section 6.4 — Comparison with Established Tokenizers - Add limitation discussion to Section 7 - Add Llamagen Implementation details in Appendix A.1 - Add Llamagen generative results to Appendix A.4 and Table 12

Assigned Action Editor: ~Hao_Tang1

Submission Number: 7513

Loading