Efficient Multi-modal Large Language Models via Visual Token Grouping

13 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Multi-modal Learning
Abstract: The development of Multi-modal Large Language Models (MLLMs) has significantly advanced various downstream applications, including visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. In this paper, we introduce \methodname, a novel grouping mechanism that leverages the capabilities of pretrained vision encoders to group similar image segments without the need for segmentation masks. With the isolated attention we adopt, \methodname can identify and eliminate redundant visual tokens, which effectively reduces computational demands. Extensive experiments demonstrate that the effectiveness of\methodname , maintains over 98.1% of the original performance while achieving a reduction of over 27% in TFLOPS.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 277
Loading