Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models
Abstract: Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities.
In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers.
Building on this insight, we introduce \ours, a training recipe that achieves efficient multilingual enhancement for LVLMs by \textbf{P}recise \textbf{LA}nguage-\textbf{S}pecific layers fine-\textbf{T}uning.
\ours first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MMBench and MMMB demonstrate that \ours effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14\% of the parameters tuned.
Further analysis reveals that \ours can be generalized to low-resource and complex visual reasoning tasks, facilitating the language-specific visual information engagement in shallow layers\footnote{The project will be available at: \url{https://github.com/fmm170/PLAST}}.
Loading