HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Wenqiao Zhang; Tianwei Lin; Jiang Liu; Haoyuan Li; Fangxun Shu; Wanggui He; Zhelun Yu; Lei Zhang; Zheqi Lv; Hao Jiang; Juncheng Li; Siliang Tang; Yueting Zhuang

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Wenqiao Zhang, Tianwei Lin, Jiang Liu, Haoyuan Li, Fangxun Shu, Wanggui He, Zhelun Yu, Lei Zhang, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang, Yueting Zhuang

26 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Model

Abstract: Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Unfortunately, the \emph{static} paradigm shares the same parameters to underly multi-task instruction tuning, inevitably introducing the potential \emph{task interference} or \emph{negative transfer}, \emph{i.e.}, where an improvement in the performance of one task reduces the performance of other tasks. In light of this, we introduce \textbf{HyperLLaVA}, which in conjunction with a dynamic visual expert and language expert, respectively adjusts the parameters of the projector and LLM layers conditioned on diverse instruction semantics, thereby minimizing the task interference. These experts are derived from HyperNetworks, which adaptively generates dynamic parameter shifts through visual and language guidance, enabling dynamic vision-language alignment and instruction tuning in two-stage training. To deeply study the multi-task interference of MLLM, we build the \textbf{Comprehensive Multimodal Task benchmark} (\texttt{CMT}), a comprehensive benchmark for the evaluation of multidimensional multimodal tasks. The experiments demonstrate that the superiority of the dynamic tuning paradigm for multi-task instruction following on \texttt{CMT} and general MLLM benchmarks. Our project is available at \href{https://anonymous.4open.science/r/HyperLLaVA-D58E}{https://anonymous.4open.science/r/HyperLLaVA-D58E}.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5708

Loading