Keywords: high-resolution adaptation, multimodal large language models
Abstract: In existing multimodal large language models (MLLMs), image resolution plays a significant role for  granular visual recognition.  However, directly increasing image resolution leads to expensive computational cost for MLLMs.  In this paper, we  reveal that a combination of low- and high-resolution visual features can efficiently mitigate this shortcoming.  Based on this principle, we propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA). In particular, MRA adopts two visual pathways for  images of different resolutions, where  high-resolution visual information is embedded into the low-resolution pathway via the novel mixture-of-resolution adapters (MR-Adapters). This design also   greatly  reduces the input sequence length of MLLMs. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR. We conduct extensive  experiments on 17 vision-language (VL) tasks, which show that LLaVA-HR outperforms existing MLLMs on 15 VL tasks, e.g., +5.2\% on TextVQA.  More importantly,    both training and inference  of LLaVA-HR remain efficient with MRA, e.g.,  20 training hours and  faster inference speed  than LLaVA-NeXT.  Source codes are  released at: https://github.com/luogen1996/LLaVA-HR.
Supplementary Material:  pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2025
Loading