Autoregression with Self-Token Prediction

Dengsheng Chen; Yangming Shi; Jian Wang; Enhua Wu

Autoregression with Self-Token Prediction

Dengsheng Chen, Yangming Shi, Jian Wang, Enhua Wu

05 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion models, generative models, AIGC, AR, autoregressive. omnimodal, multimodal

TL;DR: We propose self-token prediction, a paradigm that conditions each token on ground-truth references during training, ensuring consistency with causal inference while avoiding identity collapse.

Abstract: \begin{abstract} Next-token prediction has been highly effective in language, but its extension to continuous modalities is challenging: regression over correlated latents tends to collapse into near-identity mappings, while discretization via vector-quantized encoders introduces quantization artifacts. Mask-based prediction with diffusion heads mitigates these issues, yet suffers from a train–inference mismatch, inability to use key–value caching, and poor scalability to long sequences. To overcome these limitations, we propose \emph{self-token prediction}, which conditions each token on ground-truth references during training, ensuring consistency with causal inference while avoiding identity collapse. This design supports key–value caching and parallel generation, enabling scalable, high-fidelity synthesis across text, audio, image, and video. Built on this paradigm, \textsc{OmniAR} unifies heterogeneous modalities in a shared omni-token space, achieving efficient and high-quality generation, including real-time and theoretically endless video generation.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 2263

Loading