MMAR: Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling

Jian Yang; Dacheng Yin; Yizhou Zhou; Fengyun Rao; Wei Zhai; Yang Cao; Zheng-Jun Zha

MMAR: Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling

Jian Yang, Dacheng Yin, Yizhou Zhou, Fengyun Rao, Wei Zhai, Yang Cao, Zheng-Jun Zha

28 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Model, Vision Language Model

Abstract: Recent advancements in multimodal large language models have propelled the development of joint probabilistic models for image understanding and generation. Existing methods that discretize image spaces cause information loss and reduced model capacity. Recent work attempts to integrate diffusion transformers and text autoregression show promise, but it faces challenges in incomplete image information utilization for understanding tasks — diffusion transformers encode image information within various noise levels, but image understanding tasks take only clean image as input. In this paper, we develop a novel MultiModal AutoregRessive (MMAR) probabilistic modeling framework based on continuous image representations. Unlike previous methods, MMAR avoids the information loss associated with discretization and the drawback of combining diffusion transformers with AR models. It employs a standalone diffusion-based continuous probabilistic sampler at the image token level on top of LLMs to theoretically ensure lossless image-text joint probabilistic modeling. In practice, to address the substantial optimization difficulties encountered in low-precision training regime common for LLMs, we theoretically derive an optimal diffusion model parameterization that minimizes numerical error. To balance visual understanding and generalization capabilities, we introduce a two-stage training strategy and an extremely large CFG scale for inference. The proposed MMAR significantly demonstrates scaling-up laws with more data and larger model size. Extensive evaluations are conducted on 18 image understanding benchmarks. It reveals that MMAR is the first joint image-text modeling framework that approaches comparable performance with traditional MLLMs that employ pretrained CLIP vision encoder, marking a significant step toward lossless joint probabilistic modeling of images and text.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13913

Loading