MsTok: Query-Based Multi-Scale 1D Visual Tokenization

15 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Generation, 1D Tokenizer, Hierarchical Representation, Decoupled Querying
Abstract: One-dimensional (1D) image tokenizers, such as TiTok, have achieved remarkable breakthroughs in efficient image generation by encoding images into extremely compact sequences of discrete tokens. However, we identify two critical, inherent limitations in its architecture. First, TiTok's single-scale encoding strategy restricts the capability of its latent code to simultaneously capture both macroscopic structures and microscopic details of an image, creating a representation bottleneck. Second, TiTok's design, which concatenates latent tokens with image patch sequences as a unified input to the encoder, results in a quadratic increase in computational complexity when scaling the number of latent tokens, leading to an efficiency bottleneck. To address both issues concurrently, we propose MsTok, a novel, multi-scale aware, and computationally efficient 1D image tokenizer. Our approach introduces two core innovations: 1) We constructs a hierarchical multi-scale memory by aggregating selected intermediate ViT layers with scale embeddings. 2) We decouple the latent tokens from the backbone encoder and decoder, reformulating them as a set of ``query tokens" that interact with the multi-scale memory through a separate and efficient cross-attention module after the image encoding is complete. This decoupled design reduces the computational cost of increasing latent tokens from a quadratic to a linear relationship. Experiments show that MsTok not only significantly improves image reconstruction quality but also demonstrates superior computational scalability to the number of latent tokens, paving the way for more powerful and efficient generative models.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 5423
Loading