Abstract: Visual Word Sense Disambiguation (VWSD) task aims to find the most related image among 10 images to an ambiguous word in some limited textual context. In this work, we use AltCLIP features and a 3-layer standard transformer encoder to compare the cosine similarity between the given phrase and different images. Also, we improve our model’s generalization by using a subset of LAION-5B. The best official baseline achieves 37.20% and 54.39% macro-averaged hit rate and MRR (Mean Reciprocal Rank) respectively. Our best configuration reaches 39.61% and 56.78% macro-averaged hit rate and MRR respectively. The code will be made publicly available on GitHub.
Loading