A Gain for Reconstruction, A Pain for Generation: Exploiting Representation in Visual Tokenization

Zechen Bai; Jianxiong Gao; Pichao WANG; Tong He; Mike Zheng Shou

A Gain for Reconstruction, A Pain for Generation: Exploiting Representation in Visual Tokenization

Zechen Bai, Jianxiong Gao, Pichao WANG, Tong He, Mike Zheng Shou

08 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Tokenizer, Visual Generation, Multimodal

TL;DR: We investigate the representation of visual tokenizers.

Abstract: Discrete visual tokenization is a cornerstone of modern auto-regressive (AR) image generation, yet current methods are fundamentally constrained by a trade-off between reconstruction fidelity and semantic expressivity. In this work, we first propose a principled framework for token representation learning based on three pillars: feature alignment with foundation models, structural diversification of the codebook into specialized subspaces, and explicit disentanglement to enforce semantic independence. We materialize these principles in a novel tokenizer, Semantic Subspace Quantization (SSQ), which achieves state-of-the-art image reconstruction. However, this success reveals a critical and previously overlooked paradox: the semantically rich, structured representations that excel at reconstruction cause a significant performance collapse in standard AR generative models. To resolve this Reconstruction-Generation Discrepancy, we introduce a novel tokenizer-generator co-design methodology, systematically adapting the AR model's architecture and training curriculum to harness the multi-faceted nature of SSQ's tokens. Our final, synergistic system effectively alleviates this discrepancy, achieving state-of-the-art performance on high-fidelity reconstruction and generation, demonstrating a new path forward for discrete visual modeling.

Primary Area: generative models

Submission Number: 3005

Loading