Unified Multimodal Model as Auto-Encoder

Unified Multimodal Model as Auto-Encoder

ICLR 2026 Conference Submission104 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal, Unified Multimodal Model, Generative Model

TL;DR: Exploring synergy between visual generation and perception by formulating the unified multimodal model as autoencoder.

Abstract: The pursuit of unified multimodal models (UMMs) has long been hindered by a fundamental schism between multimodal understanding and generation. Current approaches typically disentangle the two and treat them as separate endeavors with disjoint objectives, missing the mutual benefits. We argue that true unification requires more than just merging two tasks. It requires a unified, foundational objective that intrinsically links them. In this paper, we introduce an insightful paradigm through the **Auto-Encoder lens**, *i.e.*, regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. We argue that: *if the encoder truly "understands" the image, its description should capture all essential structure, and if the decoder truly "understands" the text, it should recover that structure faithfully.* Hence, high-fidelity reconstruction serves as a powerful perspective for genuine multimodal unification, evidencing near-lossless, bidirectional information flow between the two processes. To implement this, we propose **UAE**, where we begin by pre-training the decoder with the proposed 700k long-context image-caption pairs to direct it to "understand" the fine-grained and complex semantics from the text, as longer intermediate text, in our Auto-Encoder framework, can preserve more information from the input image for reconstruction. We then propose **Unified-GRPO** via reinforcement learning (RL) to unify the two, which covers two complementary stages: (1) *Generation for Understanding*, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual perception; (2) *Understanding for Generation*, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception like small object and color recognition (verified on MMT-Bench). This bidirectional improvement reveals a deep synergy: under the unified reconstruction objective, generation and understanding can mutually benefit each other, moving closer to truly unified multimodal intelligence.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 104

Loading