TL;DR: We explore whether model merging can be used to improve the reasoning ability of VLMs and we further adopt the model merging as an interpretability tool to analyze the inner workings of VLMs.
Abstract: Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood.
In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models.
Unlike previous works that often focus on merging models of the same kind, we propose merging models **across modalities**, enabling the incorporation of the reasoning capabilities of LLMs into VLMs.
Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a **training-free** manner.
Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.
Lay Summary: Modern AI systems that understand images and text tend to struggle with complex reasoning tasks, such as interpreting charts or solving math problems from pictures, because they lack strong reasoning skills when images are involved.
We introduce a simple method that merges a reasoning-focused text AI into an image-and-text AI system, effectively transferring reasoning knowledge without any extra training. The merged AI solves visual reasoning puzzles much more accurately while retaining its ability to recognize and describe images.
By studying how different parts of the merged system change, we find that image-recognition skills reside in the early layers and reasoning skills in the later layers, and that our method spreads reasoning ability across the entire network.
Our work provides a straightforward way to build smarter multimodal AI and reveals how perception and reasoning interact inside these models.
Link To Code: https://github.com/shiqichen17/VLM_Merging
Primary Area: Deep Learning->Large Language Models
Keywords: Interpretability, Model Merging, Vision Language Models
Submission Number: 5671
Loading