Semantic-Aware Prefix Learning via Token Truncation for Efficient Image Generation

TMLR Paper7941 Authors

15 Mar 2026 (modified: 27 May 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects semantic conditions as prefix-preserved invariants into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility via progressive token truncation. This leads to information-ordered token sequences that support length-adaptive encoding and graceful truncation. To exploit the resulting latent space for generation, we further introduce CARD, a hybrid Causal AutoRegressive--Diffusion generator. CARD first models global structural dependencies autoregressively and then refines the conditional distribution via flow matching for high-fidelity synthesis. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance with compact token budgets.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ning_Yu2
Submission Number: 7941
Loading