Why do LLaVA Vision-Language Models Reply to Images in English?

Published: 06 May 2025, Last Modified: 29 May 2025VLMs4All 2025 PosterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Keywords: multilingual models, multimodal models, model design, interpretability
TL;DR: We explore why LLaVA models reply to non-English queries in English when an image is provided.
Abstract: We identify a novel pathology of multilingual vision-language models (VLMs): adding an image to the input reduces the likelihood that the model will reply in the same language as the query. We term this pathology \textit{Image-induced Fidelity Loss} (IFL), and study its prevalence, cause and remedies in LLaVA-style VLMs. On prevalence, we show that IFL occurs in four different LLaVA-style VLMs across three sizes and fourteen languages. Systematic experimental ablation of the LLaVA design space shows that among training data language, vision backbone and language backbone, the choice language backbone has the largest impact on IFL. This finding is supported by examination of the input embeddings at the point of multimodal fusion, where visual inputs are encoded separately to textual ones, regardless of language. Finally, we show that a lightweight intervention technique from the mechanistic interpretability literature can reduce IFL. Taken together, we formalize a novel challenge arising in multilingual multimodal settings and comprehensively analyze its prevalence and causes in a popular class of VLMs.
Submission Number: 10
Loading