Abstract: Pun memes, which combine wordplay with visual elements, represent a popular form of humor in Chinese online communications.
Despite their prevalence, current Vision-Language Models (VLMs) lack systematic evaluation in understanding and applying these culturally-specific multimodal expressions. In this paper, we introduce PunMemeCN, a novel benchmark designed to assess VLMs' capabilities in processing Chinese pun memes across three progressive tasks: pun meme detection, sentiment analysis, and chat-driven meme response. PunMemeCN consists of 1,959 Chinese memes (653 pun memes and 1,306 non-pun memes) with comprehensive annotations of punchlines, sentiments, and explanations, alongside 2,008 multi-turn chat conversations incorporating these memes. Our experiments indicate that state-of-the-art VLMs struggle with Chinese pun memes, particularly with homophone wordplay, even with Chain-of-Thought prompting. Notably, punchlines in memes can effectively conceal potentially harmful content from AI detection. These findings underscore the challenges in cross-cultural multimodal understanding and highlight the need for culture-specific approaches to humor comprehension in AI systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, corpus creation, evaluation, reproducibility
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: Chinese
Keywords: benchmarking, corpus creation, evaluation, reproducibility
Submission Number: 5494
Loading