Keywords: multi-modal language models, cross-attention, moe
TL;DR: an efficient vision-language model for visual understanding.
Abstract: In the field of multi-modal language models, the majority of methods are built on an archi-
tecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt,
directly feeding it into the language models alongside textual tokens. However, when dealing
with long sequences of visual signals or inputs such as videos, the self-attention mechanism
of language models can lead to significant computational overhead. Additionally, using
single-layer ViT features makes it challenging for large language models to perceive visual
signals fully. This paper proposes an efficient multi-modal language model to minimize
computational costs while enabling the model to perceive visual signals as comprehensively
as possible. Our method primarily includes: (1) employing cross-attention to image-text
interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the
Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves
competitive scores on public multi-modal benchmarks and performs well in tasks such as
image captioning and video captioning.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3574
Loading