Multi-modality Expansion and Retention for LLMs through Parameter Merging and Decoupling

Junlin Lee; Guodong DU; Jing Li; Sim Kuan Goh; Ho-Kin Tang; Yequan Wang; Fangming Liu; Saleh Alharbi; Daojing He; Min Zhang

Multi-modality Expansion and Retention for LLMs through Parameter Merging and Decoupling

Junlin Lee, Guodong DU, Jing Li, Sim Kuan Goh, Ho-Kin Tang, Yequan Wang, Fangming Liu, Saleh Alharbi, Daojing He, Min Zhang

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Model Merging, Parameter Decoupling, Knowledge Localization

TL;DR: We propose MMER, a training-free approach that leverages parameter merging and decoupling to reuse and compose existing MLLMs, enabling multimodal expansion for LLMs while retaining the performance of original MLLMs.

Abstract: Extensive fine-tuning of the synthesis between multimodal encoders and Large Language Models (LLMs) on modality-specific data can expand the modalities that LLM can handle, leading to the formation of Multimodal Large Language Models (MLLMs). However, this paradigm to expanding modalities heavily relies on initiating fine-tuning from scratch with new multimodal data, which is both resource-intensive and inflexible. In this paper, we propose $\textit{MMER (Multi-modality Expansion and Retention)}$, a novel $\textit{training-free}$ approach that reuses and composes existing MLLMs to facilitate effective multimodal expansion while retaining the original performance of each MLLM. In particular, MMER maintains the multimodal encoders of the MLLMs while merging their LLM parameters. By comparing the original LLM parameters with the merged ones, MMER can create binary masks that enable an approximate separation of the LLM parameters for each modality. This process allows the decoupled parameters to independently process modality-specific inputs, thereby reducing parameter conflicts and maintaining the fidelity of the original MLLMs. Additionally, MMER integrates strategies to prevent catastrophic forgetting by employing a similar approach to separately decouple the parameters fine-tuned on new tasks from the original parameters. Experiments on three multimodal tasks and fourteen dual-modal tasks show significant improvements over recent baselines, demonstrating that MMER can effectively expand multimodal capabilities of LLMs while retaining 99.6\% of the original performance. Further experiments in both single-task and cross modalities multi-task scenarios reveal that MMER significantly mitigates catastrophic forgetting.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4601

Loading