Ask, Acquire, Understand: A Multimodal Agent-based Framework for Social Abuse Detection in Memes

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Web mining and content analysis
Keywords: online harassment, Multimodal, Language and Vision, Social Media, online trust and safety
Abstract: Memes serve as a powerful medium of expression in the digital age, shaping cultural discourse and conveying ideas succinctly and engagingly. However, their potential for social abuse highlights the importance of developing effective methods to detect harmful content within memes. Recent studies on memes have focused on transforming images into textual captions using large language models (LLMs). However, these approaches often result in non-informative captions. Furthermore, previous methods have only been tested on limited datasets, providing insufficient evidence of their robustness. To address these limitations, we present a multimodal, agent-based framework designed to generate informative visual descriptions of memes by asking insightful questions to improve visual descriptions in zero-shot visual question-answering settings. Specifically, we leverage an LLM as agents with distinct roles and a large multimodal model (LMM) as a vision expert. These agents first analyze the images and then ask informative questions related to potential social abuse in memes to obtain high-quality answers about the images. Through continuous discussion guided by instructional prompts, the agents gather high-quality information while repeatedly acquiring image data from the LMM, which helps detect social abuse in memes. Finally, the discussion history and basic information are classified using the LLM to obtain the final prediction results in a zero-shot setting. Experimental results on a meme benchmark dataset sourced from 5 diverse meme datasets, comprising 6,626 memes spanning 5 tasks of varying complexity related to social abuse, demonstrate that our framework outperforms state-of-the-art methods, with detailed comparative analysis and ablation studies, further validating its generalizability and ability to retrieve more relevant information for detecting social abuse in memes.
Submission Number: 629
Loading