LatentBit: Discrete Visual Tokenization with Preserved Continuous Structure

ICLR 2026 Conference Submission20515 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generative models, Tokenizers, Image tokenization, Video tokenization, Masked language models
Abstract: Image and video autoregressive generative models are limited by their reliance on the language-based framework, and converging evidence points to their discrete representation as the bottleneck. Recent work addresses this bottleneck by constraining representational capacity via latent binarization and by scaling codebooks, yielding measurable generation gains. However, binarizing the codes destroys metric structure, coarsens the latent manifold, and degrades reconstruction under the same token budget. We propose an efficient tokenizer that induces a continuous latent manifold on par with continuous representations, without additional GAN refinements or iterative sampling strategies. Our tokenizer learns a discrete vocabulary aligned to a frozen continuous latent geometry, preserving metric structure and delivering competitive reconstruction quality with a scalable codebook. While naively scaling the codebook increases compute and memory demand, we overcome this limitation by decomposing tokens into bits. On top of this, we train a masked-language-model (MLM) generator with bit-wise prediction, and find that the bit-wise strategy yields better likelihood and faster convergence against alternative subgrouping schemes. This work substantially narrows the performance gap between discrete and continuous representations, bringing discrete approaches close to parity with continuous variants in both reconstruction and generation quality.
Primary Area: generative models
Submission Number: 20515
Loading