Keywords: Multimodal Models, Visual Understanding, Large Language Models, Image Understanding, Short-video Understanding, Long-video Understanding
TL;DR: This survey elaborates on the characteristics and model designs of three visual understanding tasks: image understanding, short video understanding, and long video understanding.
Abstract: Large Language Models (LLMs) have demonstrated
remarkable performance in natural language processing tasks by
increasing the number of model parameters and the volume of
training data. To extend this capability to visual understanding
tasks, multimodal models (MM-LLMs) have been developed
by integrating LLMs with visual encoders. These models are
capable of handling tasks such as image captioning, detailed
image description, and image question answering, as well as more
complex tasks like video understanding. This survey first outlines
the characteristics and challenges of three visual understanding
tasks: image understanding, short video understanding, and long
video understanding. It then provides a detailed introduction to
the model architectures used in these tasks, highlighting their
similarities and differences, and discusses the evolving trends
in model training methods. Additionally, the paper presents
performance evaluations of several representative models and
offers insights into the future directions of visual understanding
MM-LLMs.
Submission Number: 4
Loading