OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

Han Li; Xinyu Peng; Yaoming Wang; Zelin Peng; Xin Chen; Rongxiang Weng; Jingang Wang; Xunliang Cai; Wenrui Dai; Hongkai Xiong

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong

08 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: unified multimodal model, decoder-only architecture, mixture-of-expert, autoregressive

Abstract: We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a single decoder-only transformer architecture. OneCAT uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution image inputs and outputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) design trained with a unified autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) with proposed scale-aware adapter (SAA) that drastically reduces decoding latency compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT outperforms existing unified models across benchmarks for multimodal understanding, generation, and editing.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 3014

Loading