A Survey on Visual Understanding Multimodal Large Language Models

25 Feb 2025 (modified: 01 Mar 2025)XJTU 2025 CSUC SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Models, Visual Understanding, Large Language Models, Image Understanding, Short-video Understanding, Long-video Understanding
TL;DR: This survey elaborates on the characteristics and model designs of three visual understanding tasks: image understanding, short video understanding, and long video understanding.
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in natural language processing tasks by increasing the number of model parameters and the volume of training data. To extend this capability to visual understanding tasks, multimodal models (MM-LLMs) have been developed by integrating LLMs with visual encoders. These models are capable of handling tasks such as image captioning, detailed image description, and image question answering, as well as more complex tasks like video understanding. This survey first outlines the characteristics and challenges of three visual understanding tasks: image understanding, short video understanding, and long video understanding. It then provides a detailed introduction to the model architectures used in these tasks, highlighting their similarities and differences, and discusses the evolving trends in model training methods. Additionally, the paper presents performance evaluations of several representative models and offers insights into the future directions of visual understanding MM-LLMs.
Submission Number: 4
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview