Keywords: diffusion models, generative models, AIGC, AR, autoregressive. omnimodal, multimodal
TL;DR: We propose self-token prediction, a paradigm that conditions each token on ground-truth references during training, ensuring consistency with causal inference while avoiding identity collapse.
Abstract: \begin{abstract}
Next-token prediction has been highly effective in language, but its extension to continuous modalities is challenging: regression over correlated latents tends to collapse into near-identity mappings, while discretization via vector-quantized encoders introduces quantization artifacts. Mask-based prediction with diffusion heads mitigates these issues, yet suffers from a train–inference mismatch, inability to use key–value caching, and poor scalability to long sequences. To overcome these limitations, we propose \emph{self-token prediction}, which conditions each token on ground-truth references during training, ensuring consistency with causal inference while avoiding identity collapse. This design supports key–value caching and parallel generation, enabling scalable, high-fidelity synthesis across text, audio, image, and video. Built on this paradigm, \textsc{OmniAR} unifies heterogeneous modalities in a shared omni-token space, achieving efficient and high-quality generation, including real-time and theoretically endless video generation.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2263
Loading