Abstract: Figurative language is central to humorous and persuasive communication.
Internet memes, as a popular form of multimodal online communication, often use figurative elements to convey layered meaning through the combination of text and images, but little is known on what elements vision-language models (VLMs) utilize to detect non-literal meaning in memes.
To address this gap, we evaluate nine state-of-the-art generative VLMs
on their ability to detect and differentiate six types of non-literal meaning in memes.
Our results show that VLMs outperform a majority-vote baseline, and, importantly, their accuracy improves as the figurative complexity of memes increases.
Model performance across figurative categories varies by modality: identifying irony relies on text, while anthropomorphism on image.
Although VLMs demonstrate competitive performance on single-modality inputs, they fail to fully integrate multimodal content.
We thus highlight both the capabilities and limitations of today's VLMs in figurative meme understanding.
Paper Type: Short
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Research Area Keywords: figurative language, figurative meaning , internet memes, multimodality, vision-language models
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Keywords: figurative language, figurative meaning, internet memes, multimodality, vision-language models
Submission Number: 6427
Loading