LacTok: Latent Consistency Tokenizer for High-resolution Image Reconstruction and Generation by 256 Tokens

Qingsong Xie; Luyuan Zhang; Zhao Zhang; Siyuan Li; Zhe Huang; Changwang Zhang; Zhenyu Yang; Haonan Lu

LacTok: Latent Consistency Tokenizer for High-resolution Image Reconstruction and Generation by 256 Tokens

Qingsong Xie, Luyuan Zhang, Zhao Zhang, Siyuan Li, Zhe Huang, Changwang Zhang, Zhenyu Yang, Haonan Lu

20 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tokenizer, consistency model, image generation, image reconstruction

TL;DR: A novel and efficient tokenizer for high-resolution image reconstruction and generation

Abstract: Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (LacTok) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024×1024 images using only 256 tokens—a 16× compression over VQGAN. LacTok integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies; thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate LacTok’s superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024×1024 image reconstruction. We also extend LacTok to a textto- image generation model, LacTokGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 23662

Loading