Leveraging ensemble deep models and llm for visual polysemy and word sense disambiguation

Published: 2025, Last Modified: 09 Jan 2026Multim. Tools Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Visual Polysemy Disambiguation (VPD) and Visual Word Sense Disambiguation (VWSD) are challenging tasks for both computer vision and NLP since an image can have diverse contextual interpretations, ranging from visual representations to abstract concepts. In this paper, we propose a novel approach to address the challenges of VPD and VWSD by leveraging ensemble deep models from computer vision to alleviate the problem of VPD and using the strength of LLMs to mitigate the problem of WSD. We first generate visually representative images from textual descriptions through a zero-shot text-to-image generation framework using image scrapping and Google search. We then employ an ensemble of classifiers and a deep network to learn feature representations, classify images into contexts and find the best match. Similarly, we perform the reverse process of generating textual descriptions from images using a Vision Transformer model and calculate the cosine distance with the actual text. Experimental evaluation of benchmark datasets demonstrates the effectiveness of our combined approach in strengthening both text-to-image and image-to-text generation, we improve disambiguation accuracy, providing a robust solution for VWSD with an MRR of \(95.77\%\) and a Hit rate of \(92.00\%\) surpassing state-of-the-art methods.
Loading