A Survey on Multimodal Large Language Models

28 Feb 2025 (modified: 01 Mar 2025)XJTU 2025 CSUC SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal, Large Language Models, VisionLanguage Models
TL;DR: This paper reviews the development of Multimodal Large Language Models (MLLMs), analyzes their frameworks and training strategies, and provides insights to advance research and applications in the field.
Abstract: In recent years, Multimodal Large Language Models (MLLMs) have gradually become an important research direction in the field of artificial intelligence. Traditional unimodal language models primarily rely on textual data, and although they are capable of handling language tasks, their performance is limited when dealing with non-text data such as images and audio. MLLMs integrate various forms of data, such as text, images, audio, and video, significantly improving performance in multimodal tasks, including visual-language understanding, cross-modal reasoning, and vision-based generation tasks. These models provide a more comprehensive ability to understand and reason with information, driving the diverse application of intelligent systems. This paper first reviews the basic architecture of current MLLM research, providing a detailed introduction to the model training strategies (pre-training, instruction-tuning, and alignment tuning) and data processing methods, while also exploring common evaluation criteria for multimodal tasks. Next, the paper discusses the potential for expanding MLLMs, including how to optimize models to tackle more complex tasks such as multimodal reasoning, unsupervised learning, and cross-modal reasoning. Additionally, we analyze key challenges in current MLLM research, focusing on issues like modality fusion and techniques for mitigating multimodal hallucination. Finally, the paper looks ahead to future research directions for MLLMs, proposing potential breakthroughs in technology.
Submission Number: 21
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview