Keywords: Multimodal Large Language Model
Abstract: Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks.
The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning.
Unfortunately, the \emph{static} paradigm shares the same parameters to underly multi-task instruction tuning, inevitably introducing the potential \emph{task interference} or \emph{negative transfer}, \emph{i.e.}, where an improvement in the performance of one task reduces the performance of other tasks.
In light of this, we introduce \textbf{HyperLLaVA}, which in conjunction with a dynamic visual expert and language expert, respectively adjusts the parameters of the projector and LLM layers conditioned on diverse instruction semantics, thereby minimizing the task interference.
These experts are derived from HyperNetworks, which adaptively generates dynamic parameter shifts through visual and language guidance, enabling dynamic vision-language alignment and instruction tuning in two-stage training.
To deeply study the multi-task interference of MLLM, we build the \textbf{Comprehensive Multimodal Task benchmark} (\texttt{CMT}), a comprehensive benchmark for the evaluation of multidimensional multimodal tasks.
The experiments demonstrate that
the superiority of the dynamic tuning paradigm for multi-task instruction following on \texttt{CMT} and general MLLM benchmarks. Our project is available at \href{https://anonymous.4open.science/r/HyperLLaVA-D58E}{https://anonymous.4open.science/r/HyperLLaVA-D58E}.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5708
Loading