Assessing and Improving the Multilingual Visual Word Sense Disambiguation Ability of Vision-Language Models

Published: 21 Oct 2025, Last Modified: 21 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding. Due to their extensive training, they excel in tasks such as visual question answering and image retrieval. Their impressive generalization ability enables them to address novel and complex challenges. In this study, we evaluate the capability of VLMs for the Visual Word Sense Disambiguation (VWSD) task. Specifically, we examine their ability to select the correct image from a set of candidates for a given lemma based on minimal contextual information (few additional words). Additionally, we evaluate the ability of VLMs to solve this task across multiple languages and analyze the performance of multimodal encoder-based and generative VLMs.
Loading