CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Yitong Chen; Zuxuan Wu; Jiaqi Wang; Xipeng Qiu; Yu-Gang Jiang

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Yitong Chen, Zuxuan Wu, Jiaqi Wang, Xipeng Qiu, Yu-Gang Jiang

03 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image tokenization, Reconstruction, MeanFlow

Abstract: Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 22.72 PSNR and 0.681 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.

Primary Area: generative models

Submission Number: 1613

Loading