From Pixels to Punchlines: Investigating Figurative Meaning in Memes with VLMs

From Pixels to Punchlines: Investigating Figurative Meaning in Memes with VLMs

ACL ARR 2025 May Submission6427 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Figurative language is central to humorous and persuasive communication. Internet memes, as a popular form of multimodal online communication, often use figurative elements to convey layered meaning through the combination of text and images, but little is known on what elements vision-language models (VLMs) utilize to detect non-literal meaning in memes. To address this gap, we evaluate nine state-of-the-art generative VLMs on their ability to detect and differentiate six types of non-literal meaning in memes. Our results show that VLMs outperform a majority-vote baseline, and, importantly, their accuracy improves as the figurative complexity of memes increases. Model performance across figurative categories varies by modality: identifying irony relies on text, while anthropomorphism on image. Although VLMs demonstrate competitive performance on single-modality inputs, they fail to fully integrate multimodal content. We thus highlight both the capabilities and limitations of today's VLMs in figurative meme understanding.

Paper Type: Short

Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics

Research Area Keywords: figurative language, figurative meaning , internet memes, multimodality, vision-language models

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Keywords: figurative language, figurative meaning, internet memes, multimodality, vision-language models

Submission Number: 6427

Loading