A Computational Approach to Visual Metonymy
Keywords: metonymy, semiotic theory, dataset, cognitive reasoning
TL;DR: We introduce ViMET, the first visual metonymy benchmark dataset, and show that current vision-language models struggle to interpret images that evoke concepts through indirect, associative cues rather than literal depiction.
Abstract: Images often communicate more than they literally depict: a set of tools can suggest an occupation and a cultural artifact can suggest a tradition. This kind of indirect visual reference, known as visual metonymy, invites viewers to recover a target concept via associated cues rather than explicit depiction. In this work, we present the first computational investigation of visual metonymy. We introduce a novel pipeline grounded in semiotic theory that leverages large language models and text-to-image models to generate metonymic visual representations. Using this framework, we construct ViMET, the first visual metonymy dataset comprising 2,000 multiple-choice questions to evaluate the cognitive reasoning abilities in multimodal language models. Experimental results on our dataset reveal a significant gap between human performance (86.9%) and state-of-the-art vision-language models (65.9%), highlighting limitations in machines’ ability to interpret indirect visual references. Our paper was accepted to the EACL 2026 main conference (oral), and the dataset will be publicly available on release of the paper.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 20
Loading